GPU Memory Hierarchy

GPU Memory Hierarchy is the multi-level, bandwidth-stratified storage system combining registers, caches, shared memory, and DRAM, with fundamentally different access latencies and throughputs that dominate GPU application performance.

GPU Memory Hierarchy Levels

- Registers (Per-Thread): ~256 bytes per thread (Ampere). 10 cycle latency, full bandwidth (every thread accesses concurrently). Precious resource (limited total capacity).
- L1 Cache (Per-SM): 32-128 KB per SM. 20-30 cycle latency, full bandwidth. Caches global memory loads if enabled. Per-SM coherence (no cross-SM coherence in L1).
- Shared Memory (Per-SM): 48-96 KB per SM, programmer-managed. 30 cycle latency, full bandwidth (if bank-conflict free). Explicit allocation in kernel parameters.
- L2 Cache (GPU-wide): 4-40 MB (varies by GPU). 100-200 cycle latency, shared across all SMs. Victim cache for L1, also caches uncached loads.
- HBM/GDDR (Main Memory): 16-80 GB on GPU. 200-500 cycle latency, peak bandwidth 2 TB/s (HBM2e A100) vs 700 GB/s (GDDR6x). Shared memory bus (all SMs contend).

Bandwidth Characteristics at Each Level

- Register Bandwidth: ~14-20 TB/s per SM (Ampere). All threads access simultaneously. Bottleneck: register count, not bandwidth.
- L1 Bandwidth: Limited by L1 port width. ~64 bytes per cycle typical (matching SM bus width). Sufficient for most kernels if L1 hits.
- L2 Bandwidth: Shared, measured as aggregate across all SMs. Peak = L2 frequency × port width. Typically 1-2 TB/s.
- DRAM Bandwidth: HBM2e 2 TB/s peak (Ampere A100). GDDR6X ~700 GB/s (RTX GPUs). Practical sustained: 80-90% of peak (protocol overhead, command latency).

Coalescing Rules for Global Memory

- Coalescing Requirement: 32 consecutive threads access 32 consecutive 4-byte words (128 bytes). Hardware merges into single 128-byte transaction.
- Coalescing Efficiency: Perfect coalescing = 1 transaction per 32 loads. Scattered access = 32 transactions (one per load). Cache size impacts coalescing benefit.
- Cache Benefits: If coalesced access pattern fits in L1/L2, subsequent accesses hit cache (no additional DRAM traffic). Cache reduces importance of perfect coalescing.
- Coalescing Patterns: Stride-1 (consecutive access) perfect. Stride-2 requires 2 transactions. Irregular access (indices from array) uses cache to recover.

Bank Conflict in Shared Memory

- Bank Architecture: 32 banks, one per thread (Ampere). Thread i accesses bank (i mod 32). 32-bit word = bank, 64-bit double = spans 2 banks.
- Conflict Condition: Multiple threads accessing same bank in same cycle. Results in serialization (32 way conflict worst case = 32x slowdown).
- Conflict Avoidance: Stride-1 access pattern (thread i accesses bank i) conflict-free. Stride-32 (all threads same bank) severe conflict. Padding arrays alleviates strides causing conflicts.
- Broadcast: Special case: all threads read same location (broadcast, no conflict). Hardware optimization reduces to single access.

L2 Cache Policies and Control

- Cache Mode: Persistent (caching) or streaming (bypass). Persistent mode caches data expected to be reused. Streaming bypasses cache (saves cache space).
- Persistent Mode: Data cached in L2, reused. Beneficial for loops, stencil operations with repeated access.
- Streaming Mode: Each load bypasses L2. Useful for one-time accesses (reduce cache pollution, prioritize cache space for other kernels).
- Coherency: L2 cache hardware coherent (all SM L1 coherence via L2). Shared memory coherence SW responsibility (barriers, atomics).

Unified Memory and Page Migration

- Unified Memory Abstraction: Single virtual address space for CPU and GPU. malloc() returns GPU-accessible pointer. Implicit data migration (CPU ↔ GPU) as needed.
- Page Fault Mechanism: Page faults detect out-of-locality access. OS migrates page on fault (100-1000µs latency). Transparent but potentially slow.
- Prefetch Optimization: cudaMemPrefetchAsync() explicitly migrate pages to GPU before kernel execution. Avoids page-fault latency.
- Managed Memory Overhead: Page table management overhead ~5-15%. For frequently-migrating pages, explicit cudaMemcpy faster.

Prefetching Strategies

- Hardware Prefetching: GPU hardware prefetches next-line (adjacent cache line) on load miss. Reduces miss latency for streaming access (stride-1).
- Software Prefetching: Explicitly load data ahead of use. ldg() intrinsic performs load-to-cache (not register). Allows computation to overlap with pending loads.
- Double Buffering: Prefetch next iteration's data while current iteration computes. Hides DRAM latency via pipelining.
- Stream Prefetching: For streaming access patterns, hardware prefetch usually sufficient. For irregular patterns, software prefetch + synchronization necessary.

Memory Access Optimization Case Studies

- Matrix Multiplication (GEMM): Transposed B for coalescing (column-major access patterns). Tiled computation (shared memory) reduces DRAM bandwidth 10x.
- Stencil Computation: Halo exchange via global memory (coalescing important). Shared memory staging reduces DRAM by 4-10x for interior points.
- Sparse Matrix-Vector Product: Irregular access patterns. Reordering rows improves coalescing. Compression (CSR) reduces data footprint.

Want to learn more?