Occupancy is the ratio of active warps on an SM relative to its architectural maximum capacity - it estimates available parallelism for latency hiding, but optimal performance depends on more than occupancy alone.
What Is Occupancy?
- Definition: Active-warp fraction determined by block size, register use, and shared memory allocation.
- Resource Limits: High per-thread register or shared-memory use can cap active blocks and warps.
- Not Absolute: Maximum occupancy does not guarantee maximum throughput if kernels are compute-bound differently.
- Measurement: Reported by profilers alongside issue efficiency and stall breakdown.
Why Occupancy Matters
- Latency Hiding: Higher occupancy often helps mask long memory and synchronization delays.
- Launch Tuning: Occupancy analysis guides block-size and resource tradeoff decisions.
- Performance Diagnosis: Low occupancy can explain underutilization in memory-sensitive workloads.
- Portability: Occupancy-aware kernels adapt better across GPU generations with different limits.
- Optimization Balance: Helps choose between aggressive unrolling and resident-warp count.
How It Is Used in Practice
- Kernel Resource Audit: Measure register and shared-memory usage per thread block.
- Launch Sweep: Benchmark multiple block dimensions to find best throughput and occupancy balance.
- Combined Metrics: Interpret occupancy together with memory and instruction-efficiency counters.
Occupancy is a key parallelism indicator for GPU kernel tuning - best results come from balancing occupancy with instruction efficiency and memory behavior, not maximizing one metric blindly.