Warp-Level Primitives are CUDA intrinsics that allow threads within a warp to directly exchange data and perform collective operations without shared memory — enabling extremely efficient intra-warp communication at register speed.
Why Warp-Level Operations?
- Warp: 32 threads executing in SIMT lockstep.
- Traditional communication: Thread A → shared memory → Thread B (2 memory operations).
- Warp shuffle: Thread A → direct register transfer → Thread B (0 memory operations, 1 instruction).
- 4-8x faster than shared memory for intra-warp patterns.
Warp Shuffle Intrinsics
``cuda
// __shfl_sync: All threads in mask exchange values
float val = __shfl_sync(0xffffffff, src_val, src_lane);
// Gets src_val from lane src_lane, broadcast to all active lanes
// __shfl_up_sync: shift values up by delta lanes
float val = __shfl_up_sync(mask, val, delta);
// Lane i gets value from lane i-delta
// __shfl_xor_sync: butterfly exchange for reduction
float val = __shfl_xor_sync(mask, val, lane_mask);
`
Warp Reduction (Classic Pattern)
`cuda`
float sum = val;
for (int offset = 16; offset > 0; offset /= 2)
sum += __shfl_xor_sync(0xffffffff, sum, offset);
// After loop: sum contains total across all 32 lanes (in all lanes)
Warp Vote Functions
`cuda`
bool all_true = __all_sync(mask, condition); // True if all active lanes satisfy condition
bool any_true = __any_sync(mask, condition); // True if any active lane satisfies condition
uint32_t ballot = __ballot_sync(mask, pred); // 32-bit mask of which lanes satisfy pred
Cooperative Groups (CUDA 9.0+)
`cuda
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);
float val = cg::reduce(warp, input, cg::plus<float>());
``
Applications
- Warp scan/reduce: Building blocks for block-wide and grid-wide reductions.
- Histogram: Privatized per-warp histograms merged via shuffle.
- Sort: Warp-level radix sort without shared memory.
- Attention: Inner products in FlashAttention use warp-level reduction.
Warp-level primitives are the highest-performance building blocks in GPU programming — replacing shared memory for intra-warp communication is often the final optimization that pushes latency-bound kernels to peak hardware throughput.