GPU Streaming Multiprocessor (SM) Architecture is the fundamental compute building block of NVIDIA GPUs, where each SM contains a set of CUDA cores, warp schedulers, register files, shared memory, and cache — with the entire GPU composed of tens to hundreds of SMs that independently execute thread blocks, and understanding SM architecture is essential for optimizing kernel occupancy, register usage, shared memory allocation, and achieving peak throughput on any CUDA workload.
SM Components (H100 Example)
```
┌─────────────────── Streaming Multiprocessor (SM) ───────────────────┐
│ [Warp Scheduler 0] [Warp Scheduler 1] [Warp Scheduler 2] [WS 3] │
│ ↓ ↓ ↓ ↓ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ 32 FP32 │ │ 32 FP32 │ │ 32 FP32 │ │ 32 FP32 │ = 128 FP32 │
│ │ cores │ │ cores │ │ cores │ │ cores │ cores/SM │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ┌─────────┐ ┌─────────┐ │
│ │ 16 FP64 │ │ 16 FP64 │ = 32 FP64 cores/SM (if enabled) │
│ └─────────┘ └─────────┘ │
│ ┌────────────────────────┐ │
│ │ 4 Tensor Cores (4th gen)│ = Matrix multiply acceleration │
│ └────────────────────────┘ │
│ ┌────────────────────────┐ │
│ │ 16 SFU (Special Func) │ = sin, cos, rsqrt, etc. │
│ └────────────────────────┘ │
│ ┌────────────────────────┐ │
│ │ 16 LD/ST Units │ = Memory load/store │
│ └────────────────────────┘ │
│ Register File: 256 KB (65536 × 32-bit registers) │
│ L1 Cache / Shared Memory: 256 KB (configurable split) │
│ Max threads: 2048 Max warps: 64 Max blocks: 32 │
└────────────────────────────────────────────────────────────────────┘
SM Evolution Across Generations
| Architecture | Year | SMs | FP32/SM | Shared Mem/SM | Registers/SM |
|-------------|------|-----|---------|--------------|-------------|
| Pascal (P100) | 2016 | 56 | 64 | 64 KB | 256 KB |
| Volta (V100) | 2017 | 80 | 64 | 96 KB | 256 KB |
| Ampere (A100) | 2020 | 108 | 64 | 164 KB | 256 KB |
| Hopper (H100) | 2022 | 132 | 128 | 256 KB | 256 KB |
| Blackwell (B200) | 2024 | 160+ | 128 | 256 KB | 256 KB |
Warp Scheduling
- Each SM has 4 warp schedulers (Volta+).
- Each scheduler selects one warp per cycle and issues instruction.
- 4 schedulers × 1 instruction/cycle = 4 instructions/cycle per SM.
- When warp stalls (memory): Scheduler instantly switches to another ready warp.
- This is why occupancy matters: More warps → more scheduling options → better latency hiding.
Resource Partitioning per Thread Block
`
Thread block requests:
- 256 threads (8 warps)
- 32 registers per thread = 8192 registers
- 4 KB shared memory
SM capacity: 65536 registers, 256 KB shared mem, 64 warps
→ Can fit: min(65536/8192, 256K/4K, 64/8, 32 blocks) = 8 blocks
→ 64 warps active → 100% occupancy
``
Performance Optimization Based on SM
| Bottleneck | Symptom | Solution |
|-----------|---------|----------|
| Low occupancy | Few active warps | Reduce registers or shared mem per block |
| Register spill | Slow local memory access | Reduce variables, use __launch_bounds__ |
| Shared mem limited | Can't fit all data | Tile the computation |
| Compute bound | All cores busy | Algorithmic optimization |
| Memory bound | Cores waiting | Improve coalescing, caching |
GPU SM architecture is the hardware foundation that every CUDA optimization decision ultimately targets — understanding how warps are scheduled, how registers and shared memory are partitioned across thread blocks, and how many SMs compose a given GPU determines whether a kernel achieves 20% or 90% of theoretical peak throughput, making SM architecture knowledge the essential bridge between writing correct GPU code and writing fast GPU code.