GPU Shared Memory Bank Conflicts represent the performance hazard that occurs when multiple threads within a warp simultaneously access different addresses mapped to the same shared memory bank — serializing what should be parallel memory accesses and degrading shared memory bandwidth by factors proportional to the conflict degree.
Bank Architecture:
- Bank Organization: shared memory is divided into 32 banks (matching warp width), each 4 bytes wide; consecutive 4-byte words map to consecutive banks (bank = (address/4) mod 32)
- Conflict-Free Access: when all 32 threads access addresses in 32 different banks, or when all threads access the exact same address (broadcast), the access completes in a single cycle
- N-Way Conflict: when N threads access different addresses in the same bank, the hardware serializes into N sequential accesses — a 32-way conflict (all threads hit bank 0) is 32× slower than conflict-free
- Broadcast Mechanism: when multiple threads read the identical address, the hardware broadcasts the single read to all requesting threads in one cycle — this is NOT a conflict
Common Conflict Patterns:
- Stride-Based Access: accessing shared memory with stride 32 (or any multiple of 32) causes all threads to hit the same bank; stride 1 is conflict-free; stride 2 produces 2-way conflicts
- Matrix Column Access: storing a 32×32 matrix in shared memory row-major, then reading columns produces 32-way bank conflicts — the classic transpose problem
- Reduction Operations: naive tree-based reduction where stride doubles each step encounters bank conflicts at specific reduction levels
- Histogram Binning: multiple threads atomically updating the same histogram bin in shared memory creates serialized atomic conflicts
Conflict Avoidance Techniques:
- Padding: adding one extra element per row of a 2D shared memory array shifts column addresses across banks — declaring float smem[32][33] instead of float smem[32][32] eliminates column-access conflicts with minimal memory overhead
- Index Permutation: XOR-based index remapping (bank = threadIdx XOR some_value) distributes accesses across banks for specific access patterns like matrix transpose
- Access Reordering: restructuring algorithms so each warp accesses shared memory with stride-1 pattern wherever possible; converting AoS to SoA layout in shared memory
- Warp-Level Primitives: using __shfl_sync for register-to-register communication eliminates shared memory bank conflicts entirely for warp-local data exchange
Profiling and Diagnosis:
- Nsight Compute Metrics: l1tex__data_pipe_lsu_wavefronts_mem_shared reports actual wavefront count; comparing to ideal (1 per instruction) reveals conflict ratio
- Bank Conflict Ratio: (actual_wavefronts / issued_instructions) - 1 gives the average number of additional serialized accesses per instruction; values above 0.2 warrant optimization
- Occupancy Impact: severe bank conflicts do not reduce occupancy but extend instruction latency, stalling dependent operations and reducing instruction-level parallelism within each warp
GPU shared memory bank conflicts are a subtle but significant performance bottleneck that can reduce shared memory throughput by up to 32× — understanding bank mapping, applying padding or index permutation, and profiling with Nsight Compute are essential skills for achieving peak shared memory performance in CUDA kernels.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.