GPU Performance Profiling

GPU Performance Profiling encompasses systematic measurement and analysis of kernel execution, memory access patterns, and hardware utilization using Nsight tools, roofline models, and application-specific metrics to identify bottlenecks and guide optimization.

Nsight Compute and Nsight Systems Overview

- Nsight Compute: Kernel-centric profiler. Analyzes single kernel execution: register/shared memory usage, L1/L2 cache hit rates, warp stall reasons, SM efficiency.
- Nsight Systems: System-wide profiler. Timeline view of entire application: kernel launches, memory transfers, CPU-GPU synchronization, context switches, power consumption.
- Guided Analysis Workflow: Nsight Compute recommends optimizations based on measured metrics (e.g., "warp occupancy 50%, increase shared memory usage to 75%").
- Overhead: Profiling adds ~5-50% runtime overhead depending on metric set. Light profiling (SM efficiency) minimal; heavy profiling (register spills) substantial.

NVTX Annotations for Custom Metrics

- NVTX (NVIDIA Tools Extension): API to annotate application code. Marks user-defined ranges, domains, events with custom names.
- Range Annotation: nvtxRangePush/Pop() delineate code sections. Nsight timeline shows annotated regions, enabling user-level performance tracking.
- Domain Separation: nvtxDomainCreate() organizes related annotations. Example: separate domains for preprocessing, compute, postprocessing.
- Color and Category: Annotations assigned colors (visual grouping) and categories (filtering). Facilitates timeline analysis of complex multi-threaded applications.

Roofline Model for GPU Analysis

- Roofline Concept: 2D plot of achievable GFLOP/s vs arithmetic intensity (FLOP per byte transferred). Machine peak provides "roofline" ceiling.
- Peak Compute Roofline: GPU compute peak (theoretical FP32 FLOP/s). Ampere A100: 312 TFLOP/s peak.
- Peak Bandwidth Roofline: GPU memory bandwidth (theoretical throughput). A100 HBM2e: 2 TB/s peak. Roofline ceiling = MIN(peak_compute, intensity × peak_bandwidth).
- Application Characterization: Measure kernel arithmetic intensity (FLOP count / memory bytes transferred). Points below roofline indicate under-utilization.

Achieved Occupancy and Bottleneck Analysis

- Occupancy Metric: Percentage of SM warp slots filled. Occupancy = (resident_warps / max_warps_per_sm) × 100%. Max warp/SM: 64 (Volta), 48 (Ampere).
- Limiting Factors: Register pressure (32k limit per SM), shared memory allocation (96KB per SM), thread blocks per SM (varies by GPU).
- Occupancy vs Performance: Higher occupancy generally improves performance (more warps hide memory latency), but not always. Some high-register kernels benefit from lower occupancy.
- Warp Stall Reasons: Nsight reports stall causes (memory, dependency, execution resource, synchronization). Prioritize fixing most-common stall.

Memory Bandwidth Utilization

- Effective Bandwidth: Measured memory bytes (profiler) vs theoretical peak. Typical ratios: 50-90% depending on access pattern.
- Coalescing Efficiency: Consecutive threads accessing consecutive memory addresses coalesce into single transaction. Scattered access wastes bandwidth (cache-only reuse).
- Bank Conflicts: Shared memory bank conflicts serialize accesses. All 32 threads accessing same bank → 32x slowdown. Proper access pattern avoids conflicts.
- L2 Cache Effectiveness: L2 cache hit rate impacts bandwidth. Reuse distance (iterations between data access) determines cache utility.

Cache Utilization and Patterns

- L1 Cache: Per-SM cache (32-96KB depending on config). Caches load/store operations if enabled. Bank conflicts similar to shared memory.
- L2 Cache: Shared across all SMs (4-40 MB depending on GPU). Victim cache for L1, also receives uncached loads.
- Hit Rate Interpretation: High L1 hit rate (>80%) indicates locality; low ratio indicates poor spatial/temporal locality.
- Profiler L2 Analysis: Misses per 1k instructions metric. Aim for <2-5 misses/1k instructions for well-optimized kernels.

SM Efficiency and Load Balancing

- SM Efficiency: Percentage of SM slots executing useful instructions. Idle slots due to warp stalls, divergence, or under-occupancy.
- Warp Divergence Analysis: Branch divergence metrics show divergence frequency and impact. Serialization within warp reduces throughput.
- Grid-Level Load Balancing: Blocks distributed unevenly → some SMs idle while others compute. Profiler shows block-per-SM histogram.
- Dynamic Parallelism Overhead: Child kernels launched from kernel require synchronization overhead. Impacts SM efficiency if child kernels small.

Optimization Workflows

- Memory-Bound Analysis: If roofline point below bandwidth line, kernel memory-bound. Optimize: improve coalescing, increase data reuse, prefetching.
- Compute-Bound Analysis: If roofline point below compute line, kernel compute-bound. Optimize: reduce instruction count, use tensor cores, improve ILP.
- Iterative Refinement: Profile → identify bottleneck → optimize → re-profile. Typical 5-10 iteration cycle for 2-5x speedup.

Want to learn more?