GPU Profiling and Debugging

GPU Profiling and Debugging is the systematic analysis of GPU application performance and correctness using specialized tools that provide detailed metrics, timeline visualization, and error detection — where NVIDIA Nsight Compute delivers kernel-level analysis with 1000+ metrics covering memory bandwidth (achieved vs peak 1.5-3 TB/s), compute throughput (achieved vs peak 20-80 TFLOPS), occupancy (50-100%), and warp efficiency (target >90%), while Nsight Systems provides system-wide timeline showing CPU-GPU interaction, kernel launches, memory transfers, and API calls, enabling developers to identify bottlenecks (memory-bound, compute-bound, latency-bound), optimize resource utilization, and achieve 2-10× performance improvement through data-driven optimization, making profiling essential for GPU development where intuition often misleads and measurement is the only path to understanding actual performance characteristics.

Nsight Compute (Kernel Profiling):
- Purpose: detailed single-kernel analysis; 1000+ metrics; memory, compute, occupancy, warp efficiency; identifies kernel bottlenecks
- Launch: ncu ./app or ncu --set full ./app; GUI: ncu-ui; command-line or graphical interface
- Metrics: Memory Throughput (GB/s), Compute Throughput (TFLOPS), SM Efficiency (%), Occupancy (%), Warp Execution Efficiency (%), Branch Efficiency (%)
- Sections: Memory Workload Analysis, Compute Workload Analysis, Launch Statistics, Occupancy, Scheduler Statistics, Warp State Statistics

Nsight Systems (System Profiling):
- Purpose: system-wide timeline; CPU-GPU interaction; kernel launches, memory transfers, API calls; identifies system-level bottlenecks
- Launch: nsys profile ./app or nsys profile --trace=cuda,nvtx ./app; generates .qdrep file; open in nsys-ui
- Timeline View: visualizes all GPU activity; shows overlaps, gaps, synchronization points; identifies idle time
- Use Cases: multi-GPU profiling, stream concurrency, CPU-GPU overlap, kernel launch overhead, memory transfer analysis

Memory Profiling:
- Memory Throughput: achieved bandwidth / peak bandwidth; target 80-100% for memory-bound kernels; A100: 1.5-2 TB/s, H100: 2-3 TB/s
- Memory Replay: indicates uncoalesced access; replay >1.5 means poor coalescing; restructure data layout
- L1/L2 Hit Rate: cache effectiveness; high hit rate (>80%) good for reused data; low hit rate indicates streaming access
- Global Load/Store Efficiency: percentage of useful bytes loaded; low efficiency (<50%) indicates wasted bandwidth; improve coalescing
- Bank Conflicts: shared memory bank conflicts; high conflicts (>10%) cause serialization; add padding or change access pattern

Compute Profiling:
- Compute Throughput: achieved TFLOPS / peak TFLOPS; target 50-80% for compute-bound kernels; A100: 19.5 TFLOPS FP32, 312 TFLOPS FP16
- SM Efficiency: percentage of time SMs are active; target 80-100%; low efficiency indicates insufficient work or poor scheduling
- Tensor Core Utilization: percentage of time Tensor Cores active; target 50-80% for matrix operations; 312 TFLOPS on A100
- IPC (Instructions Per Cycle): instructions executed per cycle; higher is better; target 2-4 for well-optimized kernels

Occupancy Analysis:
- Achieved Occupancy: percentage of maximum warps active; target 50-100%; higher occupancy hides latency
- Theoretical Occupancy: maximum possible based on resource usage; limited by registers, shared memory, block size
- Occupancy Limiter: identifies limiting factor (registers, shared memory, block size); guides optimization
- Occupancy Calculator: CUDA Occupancy Calculator spreadsheet; predicts occupancy from resource usage; useful for tuning

Warp Efficiency:
- Warp Execution Efficiency: percentage of active threads in executed warps; target >90%; low efficiency indicates divergence
- Branch Efficiency: percentage of branches without divergence; target >90%; divergent branches cause serialization
- Predication Efficiency: percentage of instructions not predicated off; target >90%; high predication indicates divergence
- Optimization: minimize divergence; use ballot/shuffle for divergent code; restructure algorithms

Roofline Model:
- Concept: plots achieved performance vs arithmetic intensity; shows whether memory-bound or compute-bound
- Memory Roof: horizontal line at peak memory bandwidth; memory-bound kernels hit this ceiling
- Compute Roof: diagonal line at peak compute throughput; compute-bound kernels hit this ceiling
- Optimization: move toward upper-right (higher intensity, higher performance); tiling increases intensity

Timeline Analysis:
- Kernel Gaps: idle time between kernels; indicates launch overhead or synchronization; use streams to overlap
- Memory Transfer Gaps: idle time during transfers; use async transfers and streams; overlap with compute
- CPU-GPU Sync: cudaDeviceSynchronize() causes gaps; minimize synchronization; use events for fine-grained control
- Multi-GPU: visualize cross-GPU communication; identify load imbalance; optimize data distribution

NVTX Markers:
- Purpose: annotate code regions; shows in Nsight Systems timeline; helps identify bottlenecks in application logic
- API: nvtxRangePush("label"), nvtxRangePop(); marks code regions; nvtxMark("event") for single events
- Use Cases: mark training iterations, data loading, preprocessing, inference; correlate with GPU activity
- Best Practice: annotate all major code sections; hierarchical markers; color-code by category

Debugging Tools:
- cuda-memcheck: detects memory errors; out-of-bounds access, race conditions, uninitialized memory; run with cuda-memcheck ./app
- Compute Sanitizer: newer tool replacing cuda-memcheck; more features; memcheck, racecheck, initcheck, synccheck modes
- CUDA_LAUNCH_BLOCKING=1: serializes all operations; easier debugging; disables async; use only for debugging
- cuda-gdb: command-line debugger; breakpoints, watchpoints, inspect variables; cuda-gdb ./app

Performance Metrics:
- Achieved Bandwidth: GB/s of memory traffic; compare to peak (1.5-3 TB/s); target 80-100% for memory-bound
- Achieved TFLOPS: floating-point operations per second; compare to peak (20-80 TFLOPS); target 50-80% for compute-bound
- Kernel Time: total kernel execution time; identify slow kernels; focus optimization efforts
- Launch Overhead: time between kernel launches; target <1% of total time; use CUDA Graphs to reduce

Bottleneck Identification:
- Memory-Bound: <50% compute throughput, high memory throughput; optimize memory access patterns, use shared memory, reduce accesses
- Compute-Bound: <50% memory throughput, high compute throughput; use Tensor Cores, increase ILP, reduce divergence
- Latency-Bound: low occupancy, low throughput; increase occupancy, reduce register usage, increase block size
- Instruction-Bound: high instruction overhead; reduce branches, use warp primitives, optimize control flow

Optimization Workflow:
- Profile: run Nsight Compute and Nsight Systems; identify bottleneck (memory, compute, latency)
- Analyze: examine relevant metrics; memory throughput, compute throughput, occupancy, warp efficiency
- Optimize: apply targeted optimizations; memory coalescing, shared memory, occupancy tuning, divergence reduction
- Measure: re-profile; verify improvement; compare metrics before and after
- Iterate: repeat for next bottleneck; diminishing returns after 3-5 iterations; 2-10× total speedup typical

Common Profiling Patterns:
- Baseline: profile unoptimized code; establish baseline metrics; identify major bottlenecks
- Incremental: optimize one aspect at a time; measure impact; easier to attribute improvements
- Comparison: compare against reference implementation (cuBLAS, cuDNN); identify gaps; target 80-95% of library performance
- Regression: profile after code changes; detect performance regressions; maintain performance over time

Multi-GPU Profiling:
- Nsight Systems: visualizes all GPUs simultaneously; shows cross-GPU communication; identifies load imbalance
- NCCL Profiling: NCCL_DEBUG=INFO shows communication details; bandwidth, latency, algorithm selection
- Per-GPU Metrics: profile each GPU separately; identify stragglers; optimize slowest GPU first
- Scaling Analysis: measure scaling efficiency; compare 1 GPU vs N GPUs; target 80-95% efficiency

Advanced Profiling:
- Sampling: sample-based profiling for long-running applications; lower overhead; nsys profile --sample=cpu,cuda
- Metrics Collection: collect specific metrics; ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed; reduces overhead
- Kernel Replay: replay kernel with different configurations; find optimal launch parameters; ncu --launch-count 1 --replay-mode kernel
- Source Correlation: correlate metrics with source code; identify hot spots; ncu --source-level-analysis

Performance Targets:
- Memory Bandwidth: 80-100% of peak (1.5-3 TB/s); coalesced access, minimal bank conflicts
- Compute Throughput: 50-80% of peak (20-80 TFLOPS); use Tensor Cores, high ILP, minimal divergence
- Occupancy: 50-100%; balance register and shared memory usage; 256 threads per block typical
- Warp Efficiency: >90%; minimize divergence; uniform control flow
- Kernel Time: <1ms for small kernels, <100ms for large; longer kernels risk timeout; split if necessary

Best Practices:
- Profile Early: profile from the start; avoid premature optimization but measure early; establish baseline
- Profile Often: profile after each optimization; verify improvement; catch regressions
- Use Both Tools: Nsight Compute for kernel details, Nsight Systems for system view; complementary insights
- Focus on Bottlenecks: optimize slowest kernels first; 80/20 rule applies; 20% of kernels often account for 80% of time
- Measure, Don't Guess: intuition often wrong; always measure; data-driven optimization

Common Mistakes:
- Optimizing Wrong Thing: optimizing fast kernels instead of slow ones; profile to identify bottlenecks
- Ignoring Occupancy: assuming higher occupancy always better; balance with resource usage; profile to find optimal
- Over-Optimizing: diminishing returns after 3-5 iterations; 2-10× total speedup typical; know when to stop
- Not Profiling: relying on intuition; guessing bottlenecks; always measure actual performance

Real-World Impact:
- Matrix Multiplication: profiling reveals 20% of peak; optimization achieves 80-95% of peak; 4-5× speedup
- Reduction: profiling shows bank conflicts; optimization eliminates conflicts; 2-3× speedup; 60-80% of peak
- Convolution: profiling reveals memory-bound; shared memory tiling achieves 70-90% of peak; 5-10× speedup
- Custom Kernels: profiling guides optimization; 2-10× improvement typical; achieves 50-80% of peak

GPU Profiling and Debugging represent the essential tools for GPU performance optimization — by providing detailed metrics, timeline visualization, and error detection through Nsight Compute and Nsight Systems, developers identify bottlenecks, optimize resource utilization, and achieve 2-10× performance improvement through data-driven optimization, making profiling the difference between GPU code that achieves 10% or 80% of theoretical peak performance where measurement is the only path to understanding actual performance characteristics and intuition often misleads.

Want to learn more?