GPU Memory Coalescing is the hardware mechanism that combines multiple per-thread memory accesses within a warp into fewer, wider memory transactions — achieving maximum global memory bandwidth when threads access consecutive addresses, and degrading dramatically when access patterns are scattered or misaligned.
Coalescing Mechanics:
- Transaction Formation: when 32 threads in a warp execute a load/store instruction, the hardware groups their addresses into 32-byte, 64-byte, or 128-byte cache-line-aligned transactions — ideally all 32 threads hit a single 128-byte transaction
- Alignment Requirements: if the starting address is not aligned to the transaction size, an additional transaction is issued for the overflow — misaligned base pointers can double transaction count
- Stride-1 Pattern: consecutive threads accessing consecutive 4-byte elements (thread i reads addr+4i) generates one 128-byte transaction — this is the ideal pattern achieving 100% bandwidth utilization
- Stride-N Pattern: if threads access every Nth element, only 1/N of each cache line is useful — stride-2 halves effective bandwidth; stride-32 (column access in row-major 32-wide matrix) reduces utilization to 3%
Access Pattern Analysis:
- Array of Structures (AoS): interleaving fields of different structure members causes strided access when threads process one field — converting to Structure of Arrays (SoA) restores coalesced access for each field
- Matrix Transpose: naive column reads of row-major matrix produce stride-N pattern — shared memory transpose technique: load tile with coalesced reads, transpose in shared memory, write tile with coalesced writes
- Indirect/Scatter-Gather: index-based access (data[index[tid]]) produces random addresses — generally uncoalescable, requiring data reorganization (sorting by access pattern) or switching to texture cache with 2D locality
Performance Impact:
- Bandwidth Utilization: HBM2e theoretical bandwidth ~2 TB/s; uncoalesced access achieves <100 GB/s effective — proper coalescing achieves 80-95% of theoretical bandwidth
- Profiling Tools: NVIDIA Nsight Compute reports L1/L2 cache sector utilization and global memory load/store efficiency — target >80% sector utilization for memory-bound kernels
- Sector vs. Line Requests: modern GPUs (Ampere and later) request 32-byte sectors within 128-byte cache lines — partial line utilization wastes transfer bandwidth but doesn't waste storage
- L2 Cache Assistance: L2 cache partially mitigates poor access patterns by buffering recently accessed lines — but L2 capacity is limited (40-60 MB) and shared across all SMs
GPU memory coalescing represents the single most impactful optimization for memory-bound GPU kernels — understanding and achieving coalesced access patterns can improve kernel performance by 10-100× compared to naive scattered memory access.
gpu memory coalescing optimizationcoalesced memory access cudamemory transaction efficiencyglobal memory access patternmemory coalescing warp
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.