Home Knowledge Base GPU occupancy

GPU occupancy measures the ratio of active warps to maximum possible warps on a streaming multiprocessor (SM) — indicating how well a kernel utilizes GPU parallel resources, with higher occupancy generally (but not always) correlating with better performance for memory-bound workloads.

What Is Occupancy?

Why Occupancy Matters

Occupancy Calculation

Factors Limiting Occupancy:

Resource          | Limit Per SM      | Impact
------------------|-------------------|------------------
Registers         | 65,536 (typical)  | More regs → fewer threads
Shared memory     | 48-164 KB         | More shmem → fewer blocks
Block size        | 1024 threads max  | Limits parallelism
Warp slots        | 64 warps (2048 threads)| Hardware maximum

Example Calculation:

GPU: A100 (64 warps max per SM)
Kernel uses:
- 64 registers per thread
- 256 threads per block
- 8 KB shared memory per block

Registers: 65,536 / (64 × 256) = 4 blocks
Shared memory: 164 KB / 8 KB = 20 blocks
Thread limit: 2048 / 256 = 8 blocks

Bottleneck: Registers (4 blocks)
Active warps: 4 × (256/32) = 32 warps
Occupancy: 32/64 = 50%

Checking Occupancy

NVIDIA Tools:

# Nsight Compute profiling
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active 
    ./my_cuda_program

# CUDA Occupancy Calculator (spreadsheet tool)
# Also available as API

CUDA API:

int blockSize = 256;
int minGridSize;
int maxBlockSize;

cudaOccupancyMaxPotentialBlockSize(
    &minGridSize, &maxBlockSize, 
    myKernel, 0, 0
);

// Use maxBlockSize for kernel launch

PyTorch Kernel Info:

import torch
from torch.utils.benchmark import Timer

# Profile to see occupancy
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
) as prof:
    result = model(input)

print(prof.key_averages().table())

Improving Occupancy

Strategies:

Issue              | Solution
-------------------|----------------------------------
Too many registers | Use -maxrregcount compiler flag
                   | Spill to local memory (slower)
                   | Reduce kernel complexity
                   |
Too much shared mem| Reduce shared memory usage
                   | Use global memory (slower)
                   | Split kernel
                   |
Block size too small| Increase threads per block
                   | Aim for multiple of 32
                   |
Block size too large| Reduce to allow more blocks

Register Limiting:

// Limit registers per thread
__launch_bounds__(256, 4)  // 256 threads, 4 blocks/SM
__global__ void myKernel() {
    // Compiler will limit registers to achieve this
}

Occupancy vs. Performance

Not Always Correlated:

Scenario              | High Occupancy | Performance
----------------------|----------------|------------
Memory-bound kernel   | Helps          | Improves
Compute-bound kernel  | May not help   | Depends
High ILP              | Less important | Good anyway
Low latency needed    | Very important | Critical

When Low Occupancy Is OK:

- Kernel is compute-bound
- High instruction-level parallelism (ILP)
- Data fits in cache
- Register usage enables optimizations

Occupancy Guidelines

Occupancy | Interpretation
----------|----------------------------
>75%      | Good for memory-bound
50-75%    | Usually acceptable
25-50%    | May leave performance on table
<25%      | Likely suboptimal

Balance With:

Higher occupancy trades:
- Registers (more spills)
- Shared memory (less per block)
- Block flexibility

Lower occupancy allows:
- More registers (faster compute)
- More shared memory
- Compiler optimization

GPU occupancy is one metric among many for kernel optimization — while important for memory-bound workloads, blindly maximizing occupancy without understanding the kernel's characteristics can actually hurt performance.

occupancyutilizationefficiencywarpssmregistersgpu

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.