Warp-Level Primitives

Warp-Level Primitives are the low-level CUDA intrinsics that enable direct communication and coordination between threads within a 32-thread warp without using shared memory — including shuffle operations (__shfl_sync, __shfl_down_sync, __shfl_up_sync, __shfl_xor_sync) that exchange data between lanes at register speed (2-10× faster than shared memory), ballot operations (__ballot_sync) that collect predicate results into bitmask, and vote operations (__any_sync, __all_sync) that enable warp-wide decisions, achieving 500-1000 GB/s effective bandwidth for reductions and 2-5× speedup over shared memory implementations, making warp primitives essential for high-performance GPU kernels where eliminating shared memory traffic and synchronization overhead is critical for achieving 60-90% of theoretical peak performance.

Shuffle Operations:
- Shuffle Sync: __shfl_sync(mask, var, srcLane); broadcasts value from srcLane to all active lanes; mask specifies participating threads; 2-10× faster than shared memory
- Shuffle Down: __shfl_down_sync(mask, var, delta); shifts data down by delta lanes; lane i receives from lane i+delta; optimal for tree reductions
- Shuffle Up: __shfl_up_sync(mask, var, delta); shifts data up by delta lanes; lane i receives from lane i-delta; useful for prefix sums
- Shuffle XOR: __shfl_xor_sync(mask, var, laneMask); butterfly exchange; lane i exchanges with lane i^laneMask; optimal for FFT, bitonic sort

Ballot and Vote Operations:
- Ballot: __ballot_sync(mask, predicate); returns 32-bit bitmask where bit i set if thread i's predicate is true; 10-100× faster than shared memory for collecting boolean results
- Any: __any_sync(mask, predicate); returns true if any active thread's predicate is true; early exit optimization; convergence detection
- All: __all_sync(mask, predicate); returns true if all active threads' predicate is true; validation, consistency checks
- Match: __match_any_sync(mask, value), __match_all_sync(mask, value); finds threads with matching values; grouping, partitioning

Warp Reduction Pattern:
- Algorithm: use __shfl_down_sync() in loop; each iteration halves active threads; log2(32) = 5 iterations; no shared memory needed
- Code Pattern: for (int offset = 16; offset > 0; offset /= 2) { val += __shfl_down_sync(0xffffffff, val, offset); }; result in lane 0
- Performance: 500-1000 GB/s effective bandwidth; 2-5× faster than shared memory reduction; no synchronization overhead
- Use Cases: sum, max, min, product across warp; building block for block-level and grid-level reductions

Warp Prefix Sum:
- Inclusive Scan: use __shfl_up_sync() in loop; each iteration doubles scan distance; log2(32) = 5 iterations; 400-800 GB/s
- Exclusive Scan: inclusive scan + shift; subtract own value; 400-800 GB/s; useful for compaction, stream compaction
- Code Pattern: for (int offset = 1; offset < 32; offset *= 2) { int temp = __shfl_up_sync(0xffffffff, val, offset); if (lane >= offset) val += temp; }
- Applications: histogram, radix sort, stream compaction; 30-60% faster than shared memory implementations

Synchronization Mask:
- Full Mask: 0xffffffff; all 32 threads participate; most common usage; assumes no divergence
- Partial Mask: specify subset of threads; handles divergence; __activemask() returns currently active threads
- Convergence: warp primitives require convergent execution; divergent warps may have undefined behavior; use __syncwarp() to reconverge
- Best Practice: use 0xffffffff for convergent code; use __activemask() for divergent code; verify with profiler

Warp-Level Atomics:
- Atomic Add: atomicAdd_block() for block-scope atomics; faster than global atomics; 10-100× speedup for high contention
- Warp Aggregation: reduce within warp first, then single atomic; reduces atomic contention by 32×; 5-20× faster than per-thread atomics
- Pattern: warp reduction → lane 0 performs atomic; optimal for histograms, counters; 300-600 GB/s
- Use Cases: histogram, binning, counting; 40-70% faster than global atomics

Warp Divergence Handling:
- Active Mask: __activemask() returns bitmask of active threads; changes with divergence; use for correct shuffle operations
- Reconvergence: __syncwarp(mask) forces reconvergence; ensures all threads reach same point; necessary after divergent branches
- Ballot for Divergence: __ballot_sync() identifies divergent paths; enables warp specialization; 20-40% speedup for heterogeneous workloads
- Best Practice: minimize divergence; use ballot to handle when unavoidable; profile warp efficiency (target >90%)

Performance Characteristics:
- Latency: shuffle operations 1-2 cycles; ballot/vote 1 cycle; shared memory 20-30 cycles; 10-20× latency advantage
- Bandwidth: 500-1000 GB/s effective for reductions; 400-800 GB/s for scans; 2-5× faster than shared memory
- Occupancy: no shared memory usage; enables higher occupancy; more active warps; better latency hiding
- Scalability: performance independent of warp count; shared memory has bank conflicts; warp primitives scale linearly

Common Patterns:
- Warp Reduction: sum, max, min across warp; 5 shuffle operations; 2-5× faster than shared memory; 10-20 lines of code
- Warp Scan: prefix sum across warp; 5 shuffle operations; 30-60% faster than shared memory; 15-25 lines of code
- Warp Broadcast: distribute value to all threads; single shuffle; 10-100× faster than shared memory; 1 line of code
- Warp Vote: collect boolean results; single ballot; 10-100× faster than shared memory; 1 line of code

Integration with Block-Level Operations:
- Hierarchical Reduction: warp reduction → shared memory → warp reduction; optimal at each level; 30-60% faster than flat reduction
- Two-Level Scan: warp scan → block scan → warp scan; 30-60% faster than pure shared memory; 400-800 GB/s
- Hybrid Approach: warp primitives for intra-warp, shared memory for inter-warp; best of both worlds; 20-40% improvement
- Best Practice: always use warp primitives within warp; shared memory only for inter-warp communication

Advanced Techniques:
- Warp Specialization: different warps perform different tasks; use ballot to coordinate; 20-40% speedup for heterogeneous workloads
- Segmented Operations: operate on multiple segments within warp; use ballot to identify boundaries; 30-60% faster than naive approach
- Warp-Level Sorting: bitonic sort with shuffle; 40-70% faster than shared memory sort; 100-300 GB/s
- Warp-Level Hash: parallel hash computation; shuffle for conflict resolution; 2-5× faster than shared memory

Debugging Warp Primitives:
- Nsight Compute: shows warp efficiency, divergence; identifies synchronization issues; guides optimization
- Warp Efficiency Metric: percentage of active threads; target >90%; low efficiency indicates divergence
- Assertions: use assert() to verify mask correctness; check lane IDs; disabled in release builds
- CUDA_LAUNCH_BLOCKING=1: serializes operations; easier debugging; use only for debugging

Compute Capability Requirements:
- Shuffle: compute capability 3.0+; widely available; A100, V100, T4 all support
- Ballot/Vote: compute capability 3.0+; standard on modern GPUs
- Match: compute capability 7.0+; Volta and newer; A100, H100 support
- Sync Suffix: _sync variants required on compute capability 7.0+; explicit mask for correctness

Performance Optimization:
- Minimize Divergence: ensure all threads in warp take same path; use ballot to handle unavoidable divergence; target >90% warp efficiency
- Use Full Mask: 0xffffffff when possible; avoids overhead of computing active mask; 5-10% faster
- Unroll Loops: unroll shuffle loops; reduces loop overhead; 10-20% speedup; compiler often does automatically
- Combine Operations: fuse multiple warp operations; reduces instruction count; 10-30% improvement

Common Use Cases:
- Reduction: sum, max, min, product; 500-1000 GB/s; 2-5× faster than shared memory; critical building block
- Prefix Sum: inclusive/exclusive scan; 400-800 GB/s; 30-60% faster than shared memory; used in compaction, sorting
- Broadcast: distribute value to all threads; 10-100× faster than shared memory; 1 line of code
- Voting: collect boolean results; 10-100× faster than shared memory; early exit, convergence detection
- Histogram: warp aggregation + atomics; 40-70% faster than per-thread atomics; 300-600 GB/s

Comparison with Shared Memory:
- Latency: warp primitives 1-2 cycles vs shared memory 20-30 cycles; 10-20× advantage
- Bandwidth: 500-1000 GB/s vs 300-600 GB/s for shared memory; 2-3× advantage
- Occupancy: no shared memory usage; enables higher occupancy; more active warps
- Complexity: simpler code; no bank conflict concerns; easier to optimize

Best Practices:
- Use Warp Primitives: prefer shuffle over shared memory for intra-warp communication; 2-10× faster
- Full Mask: use 0xffffffff for convergent code; avoids overhead; 5-10% faster
- Hierarchical: warp primitives for intra-warp, shared memory for inter-warp; optimal at each level
- Profile: use Nsight Compute to verify warp efficiency; target >90%; measure achieved bandwidth
- Minimize Divergence: ensure convergent execution; use ballot to handle unavoidable divergence

Performance Targets:
- Reduction: 500-1000 GB/s; 60-80% of peak memory bandwidth; 2-5× faster than shared memory
- Scan: 400-800 GB/s; 50-70% of peak bandwidth; 30-60% faster than shared memory
- Warp Efficiency: >90%; indicates minimal divergence; optimal resource utilization
- Occupancy: 50-100%; warp primitives don't use shared memory; enables higher occupancy

Real-World Examples:
- CUB Library: uses warp primitives extensively; 500-1000 GB/s reductions; 400-800 GB/s scans; production-quality implementations
- Thrust: warp-level optimizations in algorithms; 2-5× faster than naive implementations; widely used
- cuDNN: warp primitives in batch normalization, layer normalization; 20-40% speedup; critical for training
- Custom Kernels: histogram, reduction, scan; 40-70% faster with warp primitives; 300-1000 GB/s

Warp-Level Primitives represent the key to maximum GPU performance — by enabling direct register-to-register communication between threads at 2-10× the speed of shared memory and eliminating synchronization overhead, warp primitives achieve 500-1000 GB/s effective bandwidth and 60-90% of theoretical peak performance, making them essential for high-performance GPU kernels where every cycle counts and the difference between good and great performance often comes down to using warp primitives instead of shared memory for intra-warp operations.

Want to learn more?