GPU Performance Profiling

Keywords: gpu performance profiling nsight,nvtx annotation,roofline model gpu,achieved bandwidth occupancy,gpu bottleneck analysis

GPU Performance Profiling encompasses systematic measurement and analysis of kernel execution, memory access patterns, and hardware utilization using Nsight tools, roofline models, and application-specific metrics to identify bottlenecks and guide optimization.

Nsight Compute and Nsight Systems Overview

- Nsight Compute: Kernel-centric profiler. Analyzes single kernel execution: register/shared memory usage, L1/L2 cache hit rates, warp stall reasons, SM efficiency.
- Nsight Systems: System-wide profiler. Timeline view of entire application: kernel launches, memory transfers, CPU-GPU synchronization, context switches, power consumption.
- Guided Analysis Workflow: Nsight Compute recommends optimizations based on measured metrics (e.g., "warp occupancy 50%, increase shared memory usage to 75%").
- Overhead: Profiling adds ~5-50% runtime overhead depending on metric set. Light profiling (SM efficiency) minimal; heavy profiling (register spills) substantial.

NVTX Annotations for Custom Metrics

- NVTX (NVIDIA Tools Extension): API to annotate application code. Marks user-defined ranges, domains, events with custom names.
- Range Annotation: nvtxRangePush/Pop() delineate code sections. Nsight timeline shows annotated regions, enabling user-level performance tracking.
- Domain Separation: nvtxDomainCreate() organizes related annotations. Example: separate domains for preprocessing, compute, postprocessing.
- Color and Category: Annotations assigned colors (visual grouping) and categories (filtering). Facilitates timeline analysis of complex multi-threaded applications.

Roofline Model for GPU Analysis

- Roofline Concept: 2D plot of achievable GFLOP/s vs arithmetic intensity (FLOP per byte transferred). Machine peak provides "roofline" ceiling.
- Peak Compute Roofline: GPU compute peak (theoretical FP32 FLOP/s). Ampere A100: 312 TFLOP/s peak.
- Peak Bandwidth Roofline: GPU memory bandwidth (theoretical throughput). A100 HBM2e: 2 TB/s peak. Roofline ceiling = MIN(peak_compute, intensity × peak_bandwidth).
- Application Characterization: Measure kernel arithmetic intensity (FLOP count / memory bytes transferred). Points below roofline indicate under-utilization.

Achieved Occupancy and Bottleneck Analysis

- Occupancy Metric: Percentage of SM warp slots filled. Occupancy = (resident_warps / max_warps_per_sm) × 100%. Max warp/SM: 64 (Volta), 48 (Ampere).
- Limiting Factors: Register pressure (32k limit per SM), shared memory allocation (96KB per SM), thread blocks per SM (varies by GPU).
- Occupancy vs Performance: Higher occupancy generally improves performance (more warps hide memory latency), but not always. Some high-register kernels benefit from lower occupancy.
- Warp Stall Reasons: Nsight reports stall causes (memory, dependency, execution resource, synchronization). Prioritize fixing most-common stall.

Memory Bandwidth Utilization

- Effective Bandwidth: Measured memory bytes (profiler) vs theoretical peak. Typical ratios: 50-90% depending on access pattern.
- Coalescing Efficiency: Consecutive threads accessing consecutive memory addresses coalesce into single transaction. Scattered access wastes bandwidth (cache-only reuse).
- Bank Conflicts: Shared memory bank conflicts serialize accesses. All 32 threads accessing same bank → 32x slowdown. Proper access pattern avoids conflicts.
- L2 Cache Effectiveness: L2 cache hit rate impacts bandwidth. Reuse distance (iterations between data access) determines cache utility.

Cache Utilization and Patterns

- L1 Cache: Per-SM cache (32-96KB depending on config). Caches load/store operations if enabled. Bank conflicts similar to shared memory.
- L2 Cache: Shared across all SMs (4-40 MB depending on GPU). Victim cache for L1, also receives uncached loads.
- Hit Rate Interpretation: High L1 hit rate (>80%) indicates locality; low ratio indicates poor spatial/temporal locality.
- Profiler L2 Analysis: Misses per 1k instructions metric. Aim for <2-5 misses/1k instructions for well-optimized kernels.

SM Efficiency and Load Balancing

- SM Efficiency: Percentage of SM slots executing useful instructions. Idle slots due to warp stalls, divergence, or under-occupancy.
- Warp Divergence Analysis: Branch divergence metrics show divergence frequency and impact. Serialization within warp reduces throughput.
- Grid-Level Load Balancing: Blocks distributed unevenly → some SMs idle while others compute. Profiler shows block-per-SM histogram.
- Dynamic Parallelism Overhead: Child kernels launched from kernel require synchronization overhead. Impacts SM efficiency if child kernels small.

Optimization Workflows

- Memory-Bound Analysis: If roofline point below bandwidth line, kernel memory-bound. Optimize: improve coalescing, increase data reuse, prefetching.
- Compute-Bound Analysis: If roofline point below compute line, kernel compute-bound. Optimize: reduce instruction count, use tensor cores, improve ILP.
- Iterative Refinement: Profile → identify bottleneck → optimize → re-profile. Typical 5-10 iteration cycle for 2-5x speedup.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT