Warp-Level Primitives

Warp-Level Primitives are CUDA intrinsics that allow threads within a warp to directly exchange data and perform collective operations without shared memory — enabling extremely efficient intra-warp communication at register speed.

Why Warp-Level Operations?

- Warp: 32 threads executing in SIMT lockstep.
- Traditional communication: Thread A → shared memory → Thread B (2 memory operations).
- Warp shuffle: Thread A → direct register transfer → Thread B (0 memory operations, 1 instruction).
- 4-8x faster than shared memory for intra-warp patterns.

Warp Shuffle Intrinsics

``cuda // __shfl_sync: All threads in mask exchange values float val = __shfl_sync(0xffffffff, src_val, src_lane); // Gets src_val from lane src_lane, broadcast to all active lanes

// __shfl_up_sync: shift values up by delta lanes float val = __shfl_up_sync(mask, val, delta); // Lane i gets value from lane i-delta

// __shfl_xor_sync: butterfly exchange for reduction float val = __shfl_xor_sync(mask, val, lane_mask);`

Warp Reduction (Classic Pattern)

`cuda float sum = val; for (int offset = 16; offset > 0; offset /= 2) sum += __shfl_xor_sync(0xffffffff, sum, offset); // After loop: sum contains total across all 32 lanes (in all lanes)`

Warp Vote Functions

`cuda bool all_true = __all_sync(mask, condition); // True if all active lanes satisfy condition bool any_true = __any_sync(mask, condition); // True if any active lane satisfies condition uint32_t ballot = __ballot_sync(mask, pred); // 32-bit mask of which lanes satisfy pred`

Cooperative Groups (CUDA 9.0+)

`cuda #include <cooperative_groups.h> namespace cg = cooperative_groups;

auto block = cg::this_thread_block(); auto warp = cg::tiled_partition<32>(block); float val = cg::reduce(warp, input, cg::plus<float>());``

Applications

- Warp scan/reduce: Building blocks for block-wide and grid-wide reductions.
- Histogram: Privatized per-warp histograms merged via shuffle.
- Sort: Warp-level radix sort without shared memory.
- Attention: Inner products in FlashAttention use warp-level reduction.

Warp-level primitives are the highest-performance building blocks in GPU programming — replacing shared memory for intra-warp communication is often the final optimization that pushes latency-bound kernels to peak hardware throughput.

Want to learn more?