GPU Warp Scheduling and Divergence is the hardware mechanism by which a GPU streaming multiprocessor (SM) selects warps of 32 threads for execution each cycle and handles control-flow divergence when threads within a warp take different branch paths — understanding warp scheduling is essential for writing high-performance CUDA and GPU compute code because divergence directly reduces throughput by serializing execution paths.
Warp Execution Model:
- Warp Definition: a warp is the fundamental scheduling unit on NVIDIA GPUs, consisting of 32 threads that execute in lockstep under the Single Instruction Multiple Thread (SIMT) model
- Instruction Issue: each cycle the warp scheduler selects an eligible warp and issues one instruction to all 32 threads simultaneously — a single SM typically has 2-4 warp schedulers operating in parallel
- Occupancy: the ratio of active warps to maximum supported warps per SM — higher occupancy helps hide memory latency by allowing the scheduler to switch between warps while others wait for data
- Eligible Warps: a warp becomes eligible for scheduling when its next instruction's operands are ready and execution resources are available — stalls occur when no warp is eligible
Thread Divergence Mechanics:
- Branch Divergence: when threads in a warp encounter a conditional branch (if/else) and take different paths, the warp must serialize execution — first executing the taken path while masking inactive threads, then executing the not-taken path
- Active Mask: a 32-bit mask tracks which threads are active for each instruction — masked-off threads don't write results but still consume a scheduling slot
- Divergence Penalty: in the worst case a warp with 32-way divergence executes at 1/32 throughput — each unique path executes sequentially while 31 threads sit idle
- Reconvergence Point: after divergent branches complete, threads reconverge at the immediate post-dominator of the branch — the hardware stack tracks reconvergence points automatically
Warp Scheduling Policies:
- Greedy-Then-Oldest (GTO): favors issuing from the same warp until it stalls, then switches to the oldest ready warp — reduces instruction cache pressure and improves data locality
- Loose Round-Robin (LRR): cycles through warps in a roughly round-robin fashion — provides fairness but may increase cache thrashing compared to GTO
- Two-Level Scheduling: partitions warps into fetch groups and applies round-robin between groups while using GTO within each group — balances latency hiding with cache locality
- Criticality-Aware: prioritizes warps on the critical path of barrier synchronization to reduce overall execution time — prevents stragglers from delaying __syncthreads() barriers
Minimizing Divergence in Practice:
- Data-Dependent Branching: reorganize data so that threads within a warp follow the same path — sorting input data by branch condition or using warp-level voting (__ballot_sync) to detect uniform branches
- Predication: for short branches (few instructions), the compiler replaces branches with predicated instructions that execute both paths but conditionally write results — eliminates serialization overhead
- Warp-Level Primitives: __shfl_sync, __ballot_sync, and __match_any_sync enable threads to communicate without shared memory, often eliminating branches entirely
- Branch-Free Algorithms: replace conditional logic with arithmetic (e.g., using min/max instead of if/else) to maintain full warp utilization
Performance Impact and Profiling:
- Branch Efficiency: NVIDIA Nsight Compute reports branch efficiency as the ratio of non-divergent branches to total branches — target >90% for compute-bound kernels
- Warp Stall Reasons: profilers categorize stalls as memory dependency, execution dependency, synchronization, or instruction fetch — guides optimization priority
- Thread Utilization: average active threads per warp instruction indicates divergence severity — ideal is 32.0, values below 24 suggest significant divergence
- Occupancy vs. Performance: higher occupancy doesn't always improve performance — sometimes fewer warps with better cache utilization outperform high-occupancy configurations
Modern architectures (Volta and later) introduce independent thread scheduling where each thread has its own program counter, enabling fine-grained interleaving of divergent paths and supporting thread-level synchronization primitives that weren't possible under the older lockstep model.