GPU Memory Hierarchy is the multi-level, bandwidth-stratified storage system combining registers, caches, shared memory, and DRAM, with fundamentally different access latencies and throughputs that dominate GPU application performance.
GPU Memory Hierarchy Levels
- Registers (Per-Thread): ~256 bytes per thread (Ampere). 10 cycle latency, full bandwidth (every thread accesses concurrently). Precious resource (limited total capacity).
- L1 Cache (Per-SM): 32-128 KB per SM. 20-30 cycle latency, full bandwidth. Caches global memory loads if enabled. Per-SM coherence (no cross-SM coherence in L1).
- Shared Memory (Per-SM): 48-96 KB per SM, programmer-managed. 30 cycle latency, full bandwidth (if bank-conflict free). Explicit allocation in kernel parameters.
- L2 Cache (GPU-wide): 4-40 MB (varies by GPU). 100-200 cycle latency, shared across all SMs. Victim cache for L1, also caches uncached loads.
- HBM/GDDR (Main Memory): 16-80 GB on GPU. 200-500 cycle latency, peak bandwidth 2 TB/s (HBM2e A100) vs 700 GB/s (GDDR6x). Shared memory bus (all SMs contend).
Bandwidth Characteristics at Each Level
- Register Bandwidth: ~14-20 TB/s per SM (Ampere). All threads access simultaneously. Bottleneck: register count, not bandwidth.
- L1 Bandwidth: Limited by L1 port width. ~64 bytes per cycle typical (matching SM bus width). Sufficient for most kernels if L1 hits.
- L2 Bandwidth: Shared, measured as aggregate across all SMs. Peak = L2 frequency ร port width. Typically 1-2 TB/s.
- DRAM Bandwidth: HBM2e 2 TB/s peak (Ampere A100). GDDR6X ~700 GB/s (RTX GPUs). Practical sustained: 80-90% of peak (protocol overhead, command latency).
Coalescing Rules for Global Memory
- Coalescing Requirement: 32 consecutive threads access 32 consecutive 4-byte words (128 bytes). Hardware merges into single 128-byte transaction.
- Coalescing Efficiency: Perfect coalescing = 1 transaction per 32 loads. Scattered access = 32 transactions (one per load). Cache size impacts coalescing benefit.
- Cache Benefits: If coalesced access pattern fits in L1/L2, subsequent accesses hit cache (no additional DRAM traffic). Cache reduces importance of perfect coalescing.
- Coalescing Patterns: Stride-1 (consecutive access) perfect. Stride-2 requires 2 transactions. Irregular access (indices from array) uses cache to recover.
Bank Conflict in Shared Memory
- Bank Architecture: 32 banks, one per thread (Ampere). Thread i accesses bank (i mod 32). 32-bit word = bank, 64-bit double = spans 2 banks.
- Conflict Condition: Multiple threads accessing same bank in same cycle. Results in serialization (32 way conflict worst case = 32x slowdown).
- Conflict Avoidance: Stride-1 access pattern (thread i accesses bank i) conflict-free. Stride-32 (all threads same bank) severe conflict. Padding arrays alleviates strides causing conflicts.
- Broadcast: Special case: all threads read same location (broadcast, no conflict). Hardware optimization reduces to single access.
L2 Cache Policies and Control
- Cache Mode: Persistent (caching) or streaming (bypass). Persistent mode caches data expected to be reused. Streaming bypasses cache (saves cache space).
- Persistent Mode: Data cached in L2, reused. Beneficial for loops, stencil operations with repeated access.
- Streaming Mode: Each load bypasses L2. Useful for one-time accesses (reduce cache pollution, prioritize cache space for other kernels).
- Coherency: L2 cache hardware coherent (all SM L1 coherence via L2). Shared memory coherence SW responsibility (barriers, atomics).
Unified Memory and Page Migration
- Unified Memory Abstraction: Single virtual address space for CPU and GPU. malloc() returns GPU-accessible pointer. Implicit data migration (CPU โ GPU) as needed.
- Page Fault Mechanism: Page faults detect out-of-locality access. OS migrates page on fault (100-1000ยตs latency). Transparent but potentially slow.
- Prefetch Optimization: cudaMemPrefetchAsync() explicitly migrate pages to GPU before kernel execution. Avoids page-fault latency.
- Managed Memory Overhead: Page table management overhead ~5-15%. For frequently-migrating pages, explicit cudaMemcpy faster.
Prefetching Strategies
- Hardware Prefetching: GPU hardware prefetches next-line (adjacent cache line) on load miss. Reduces miss latency for streaming access (stride-1).
- Software Prefetching: Explicitly load data ahead of use. ldg() intrinsic performs load-to-cache (not register). Allows computation to overlap with pending loads.
- Double Buffering: Prefetch next iteration's data while current iteration computes. Hides DRAM latency via pipelining.
- Stream Prefetching: For streaming access patterns, hardware prefetch usually sufficient. For irregular patterns, software prefetch + synchronization necessary.
Memory Access Optimization Case Studies
- Matrix Multiplication (GEMM): Transposed B for coalescing (column-major access patterns). Tiled computation (shared memory) reduces DRAM bandwidth 10x.
- Stencil Computation: Halo exchange via global memory (coalescing important). Shared memory staging reduces DRAM by 4-10x for interior points.
- Sparse Matrix-Vector Product: Irregular access patterns. Reordering rows improves coalescing. Compression (CSR) reduces data footprint.
heterogeneous memory hbm gddrmemory bandwidth gpu hierarchyl1 l2 shared memory hierarchyunified memory page migrationmemory access pattern coalescing
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization โ search the full knowledge base or chat with our AI assistant.