Memory Bandwidth Optimization

Keywords: memory bandwidth optimization,bandwidth bound kernel,memory throughput,dram bandwidth,bandwidth efficiency,roofline memory

Memory Bandwidth Optimization is the performance engineering discipline of maximizing the effective utilization of available memory bandwidth in compute kernels โ€” the critical challenge for bandwidth-bound applications where the GPU or CPU is waiting for data from DRAM rather than executing compute instructions. Most deep learning inference workloads, large language model generation (decode phase), sparse computations, and data-processing kernels are memory bandwidth bound rather than compute bound, making memory access optimization the primary path to performance improvement.

Bandwidth Bound vs. Compute Bound

- Roofline Model: Performance = min(Peak FLOPS, Arithmetic Intensity ร— Memory Bandwidth).
- Arithmetic Intensity (AI): FLOPs per byte of data loaded from memory.
- Memory bound: AI < AI_ridge_point โ†’ limited by bandwidth, not compute.
- Compute bound: AI > AI_ridge_point โ†’ limited by peak FLOPS.

LLM Decode is Memory Bandwidth Bound

- During token generation (autoregressive decode): Load all model weights (7B ร— 2 bytes = 14 GB for FP16 7B model) to generate ONE token.
- Arithmetic intensity: ~1 FLOP per byte โ†’ extremely memory bound.
- A100 GPU: 2 TB/s bandwidth โ†’ generates ~140 billion parameters worth of tokens/second โ†’ ~10 tokens/second for 7B model (with batching).
- Batching: Batch 100 requests simultaneously โ†’ same 14 GB loaded โ†’ 100ร— more compute reuse โ†’ approaches compute bound.

Memory Hierarchy and Effective Bandwidth

| Level | Bandwidth (A100) | Latency | Reuse Factor |
|-------|-----------------|---------|-------------|
| Registers | >80 TB/s | 1 cycle | Per-thread |
| L1/Shared | 19 TB/s | 20 cycles | Per-CTA |
| L2 | 4 TB/s | 200 cycles | Per-GPU |
| HBM (DRAM) | 2 TB/s | 600 cycles | Global |
| PCIe (host) | 64 GB/s | ยตs | Host |

Techniques to Improve Memory Bandwidth Utilization

1. Coalesced Memory Access
- All threads in a warp must access contiguous, aligned memory addresses.
- Non-coalesced: 32 threads ร— random addresses โ†’ 32 separate DRAM transactions โ†’ 32ร— bandwidth waste.
- Coalesced: 32 threads ร— consecutive addresses โ†’ 1 DRAM transaction โ†’ full bandwidth utilized.

2. Shared Memory Tiling
- Load tile of input data from global memory โ†’ shared memory โ†’ compute from shared memory.
- Amortize global memory load over multiple compute operations โ†’ increase arithmetic intensity.
- Shared memory bandwidth: ~10ร— DRAM bandwidth โ†’ huge speedup for reused data.

3. Fused Kernels
- Instead of: Load data โ†’ compute โ†’ store โ†’ load โ†’ compute โ†’ store (multiple global memory round-trips).
- Fused: Load once โ†’ compute everything โ†’ store once โ†’ reduce global memory traffic.
- Example: Fused LayerNorm + attention: Single kernel pass through activations โ†’ 3ร— less bandwidth.

4. Quantization for Bandwidth Reduction
- FP16 โ†’ INT8: 2ร— less data โ†’ 2ร— more weights per second through bandwidth.
- INT4 (4-bit): 4ร— less data vs. FP16 โ†’ 4ร— bandwidth improvement for weight loading.
- Activation quantization: Input activations also smaller โ†’ further bandwidth reduction.

5. KV Cache Compression
- LLM inference KV cache grows linearly with sequence length โ†’ bandwidth bound.
- Group Query Attention (GQA): Share KV heads across query groups โ†’ reduce KV cache size 4โ€“8ร—.
- Paged attention: Virtual memory for KV cache โ†’ reduces memory waste โ†’ better batching.

6. Memory Layout Optimization
- Row-major vs. column-major: Must match access pattern to avoid strided access.
- Structure-of-Arrays (SoA) vs. Array-of-Structures (AoS): SoA enables coalesced access.
- Channel-last format for convolution: NHWC (batch, height, width, channel) โ†’ coalesced channel access.

7. Prefetching
- Instruction-level prefetch: Tell memory controller to load next data before it is needed.
- Software prefetch: Initiate async memory copy (cudaMemcpyAsync) while computing current batch.
- Hardware prefetch: GPU L2 prefetcher predicts sequential access patterns โ†’ automatic.

Tools for Memory Bandwidth Analysis

- Nsight Compute: Per-kernel memory throughput, DRAM utilization, L1/L2 hit rate.
- Roofline chart: Plot actual kernel on roofline โ†’ determine if memory or compute bound.
- DRAM bandwidth utilization metric: Actual vs. peak HBM bandwidth (target >70% for memory-bound kernels).

Memory bandwidth optimization is the essential performance discipline for the inference era of AI โ€” as language models with billions to hundreds of billions of parameters are deployed for real-time inference, the rate at which model weights can be streamed from memory to compute units determines user-experienced latency, server throughput, and ultimately the economics of AI service delivery, making bandwidth-aware kernel design one of the highest-value skills in modern systems programming.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT