GPU occupancy measures the ratio of active warps to maximum possible warps on a streaming multiprocessor (SM) — indicating how well a kernel utilizes GPU parallel resources, with higher occupancy generally (but not always) correlating with better performance for memory-bound workloads.
What Is Occupancy?
- Definition: Active warps ÷ Maximum warps per SM.
- Range: 0% to 100%.
- Unit: Warps (groups of 32 threads).
- Goal: Keep GPU execution units busy.
Why Occupancy Matters
- Latency Hiding: More warps = better memory latency hiding.
- Utilization: Higher occupancy often means better GPU use.
- Memory-Bound: Particularly important for memory-bound kernels.
- Not Always Key: Compute-bound kernels may not need high occupancy.
Occupancy Calculation
Factors Limiting Occupancy:
Resource | Limit Per SM | Impact
------------------|-------------------|------------------
Registers | 65,536 (typical) | More regs → fewer threads
Shared memory | 48-164 KB | More shmem → fewer blocks
Block size | 1024 threads max | Limits parallelism
Warp slots | 64 warps (2048 threads)| Hardware maximum
Example Calculation:
GPU: A100 (64 warps max per SM)
Kernel uses:
- 64 registers per thread
- 256 threads per block
- 8 KB shared memory per block
Registers: 65,536 / (64 × 256) = 4 blocks
Shared memory: 164 KB / 8 KB = 20 blocks
Thread limit: 2048 / 256 = 8 blocks
Bottleneck: Registers (4 blocks)
Active warps: 4 × (256/32) = 32 warps
Occupancy: 32/64 = 50%
Checking Occupancy
NVIDIA Tools:
# Nsight Compute profiling
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active
./my_cuda_program
# CUDA Occupancy Calculator (spreadsheet tool)
# Also available as API
CUDA API:
int blockSize = 256;
int minGridSize;
int maxBlockSize;
cudaOccupancyMaxPotentialBlockSize(
&minGridSize, &maxBlockSize,
myKernel, 0, 0
);
// Use maxBlockSize for kernel launch
PyTorch Kernel Info:
import torch
from torch.utils.benchmark import Timer
# Profile to see occupancy
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
) as prof:
result = model(input)
print(prof.key_averages().table())
Improving Occupancy
Strategies:
Issue | Solution
-------------------|----------------------------------
Too many registers | Use -maxrregcount compiler flag
| Spill to local memory (slower)
| Reduce kernel complexity
|
Too much shared mem| Reduce shared memory usage
| Use global memory (slower)
| Split kernel
|
Block size too small| Increase threads per block
| Aim for multiple of 32
|
Block size too large| Reduce to allow more blocks
Register Limiting:
// Limit registers per thread
__launch_bounds__(256, 4) // 256 threads, 4 blocks/SM
__global__ void myKernel() {
// Compiler will limit registers to achieve this
}
Occupancy vs. Performance
Not Always Correlated:
Scenario | High Occupancy | Performance
----------------------|----------------|------------
Memory-bound kernel | Helps | Improves
Compute-bound kernel | May not help | Depends
High ILP | Less important | Good anyway
Low latency needed | Very important | Critical
When Low Occupancy Is OK:
- Kernel is compute-bound
- High instruction-level parallelism (ILP)
- Data fits in cache
- Register usage enables optimizations
Occupancy Guidelines
Occupancy | Interpretation
----------|----------------------------
>75% | Good for memory-bound
50-75% | Usually acceptable
25-50% | May leave performance on table
<25% | Likely suboptimal
Balance With:
Higher occupancy trades:
- Registers (more spills)
- Shared memory (less per block)
- Block flexibility
Lower occupancy allows:
- More registers (faster compute)
- More shared memory
- Compiler optimization
GPU occupancy is one metric among many for kernel optimization — while important for memory-bound workloads, blindly maximizing occupancy without understanding the kernel's characteristics can actually hurt performance.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.