GPU Kernel Fusion

GPU Kernel Fusion is the optimization technique of combining multiple sequential GPU kernel launches into a single kernel — eliminating kernel launch overhead, reducing global memory round-trips for intermediate results, and increasing arithmetic intensity by keeping data in registers or shared memory across combined operations.

Motivation:
- Launch Overhead: each CUDA kernel launch incurs 3-10 μs of CPU-side overhead (driver calls, command buffer construction, GPU scheduling); for small kernels executing in 5-20 μs, launch overhead constitutes 15-67% of total time
- Memory Traffic: unfused kernels write intermediate results to global memory and the next kernel reads them back; global memory bandwidth is 2-3 TB/s but register bandwidth is ~100× higher — fusion keeps intermediates in registers, eliminating O(N) global memory round-trips per fused operation
- Occupancy Benefits: larger fused kernels have more instructions per thread, enabling better instruction-level parallelism and reducing occupancy requirements for latency hiding
- Cache Locality: fused operations on the same data exploit L1/L2 cache residency; unfused kernels may evict cached data between launches, especially when multiple kernels compete for limited cache capacity

Fusion Categories:
- Element-wise Fusion: combining sequences of point-wise operations (ReLU after MatMul, LayerNorm after attention) — each thread processes one element through the entire fused computation; simplest and most common fusion type
- Reduction Fusion: fusing a computation with a subsequent reduction (e.g., loss computation + gradient scaling); the thread block performs element-wise computation and reduces within shared memory in one kernel
- Producer-Consumer Fusion: fusing a producer kernel with its consumer when the producer's output is consumed exactly once — for example, fusing a GEMM with the subsequent bias addition and activation function
- Tiled Loop Fusion: fusing stencil or convolution operations that produce tiles consumed by subsequent operations; requires tile-size coordination between fused stages and halo region management

Fusion Compilers and Frameworks:
- TorchInductor (PyTorch 2.0): torch.compile() traces PyTorch operations and generates fused Triton kernels; automatically identifies fusible operation sequences and generates optimized GPU code without manual kernel writing
- XLA (TensorFlow/JAX): HLO (High-Level Optimizer) aggressively fuses element-wise operations, broadcasts, and reductions; produces large fused kernels that minimize memory traffic — jit-compiled for specific input shapes
- Triton: Python-based GPU kernel language that makes fusion accessible; programmers write fused operations at a higher abstraction level than CUDA, with the Triton compiler handling tiling, memory coalescing, and register allocation
- NVIDIA TensorRT: inference optimizer that fuses convolutional layers, batch normalization, activation functions, and skip connections into single optimized kernels — 2-5× inference speedup over unfused PyTorch execution

Fusion Limitations:
- Register Pressure: fused kernels use more registers per thread (all intermediate values live simultaneously); exceeding the register file capacity causes spilling to slow local memory, potentially negating fusion benefits
- Occupancy Reduction: higher register usage reduces the number of active warps per SM; for memory-bound computations, the occupancy reduction may outweigh the fusion benefit — profiling determines the optimal fusion boundary
- Shape Dependencies: fusion decisions depend on tensor shapes; changing input dimensions may invalidate fusion strategies — dynamic shape handling requires either re-compilation or conservative fusion decisions
- Debugging Complexity: fused kernels are harder to debug and profile; individual operation timing disappears when operations are fused, making performance bottleneck identification more difficult

GPU kernel fusion is arguably the most impactful compiler optimization for deep learning workloads — frameworks like PyTorch 2.0 (TorchInductor) and JAX (XLA) achieve 1.5-3× end-to-end training speedup primarily through aggressive kernel fusion, making it the default optimization strategy for modern deep learning compilers.

Want to learn more?