Home Knowledge Base GPU Streaming Multiprocessor (SM) Architecture

GPU Streaming Multiprocessor (SM) Architecture is the fundamental compute building block of NVIDIA GPUs, where each SM contains a set of CUDA cores, warp schedulers, register files, shared memory, and cache — with the entire GPU composed of tens to hundreds of SMs that independently execute thread blocks, and understanding SM architecture is essential for optimizing kernel occupancy, register usage, shared memory allocation, and achieving peak throughput on any CUDA workload.

SM Components (H100 Example)

 ┌─────────────────── Streaming Multiprocessor (SM) ───────────────────┐
 │  [Warp Scheduler 0] [Warp Scheduler 1] [Warp Scheduler 2] [WS 3]   │
 │       ↓                    ↓                   ↓              ↓     │
 │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐              │
 │  │ 32 FP32 │  │ 32 FP32 │  │ 32 FP32 │  │ 32 FP32 │  = 128 FP32 │
 │  │  cores  │  │  cores  │  │  cores  │  │  cores  │    cores/SM  │
 │  └─────────┘  └─────────┘  └─────────┘  └─────────┘              │
 │  ┌─────────┐  ┌─────────┐                                         │
 │  │ 16 FP64 │  │ 16 FP64 │  = 32 FP64 cores/SM (if enabled)       │
 │  └─────────┘  └─────────┘                                         │
 │  ┌────────────────────────┐                                        │
 │  │ 4 Tensor Cores (4th gen)│  = Matrix multiply acceleration       │
 │  └────────────────────────┘                                        │
 │  ┌────────────────────────┐                                        │
 │  │ 16 SFU (Special Func)  │  = sin, cos, rsqrt, etc.              │
 │  └────────────────────────┘                                        │
 │  ┌────────────────────────┐                                        │
 │  │ 16 LD/ST Units         │  = Memory load/store                  │
 │  └────────────────────────┘                                        │
 │  Register File: 256 KB (65536 × 32-bit registers)                  │
 │  L1 Cache / Shared Memory: 256 KB (configurable split)             │
 │  Max threads: 2048   Max warps: 64   Max blocks: 32               │
 └────────────────────────────────────────────────────────────────────┘

SM Evolution Across Generations

ArchitectureYearSMsFP32/SMShared Mem/SMRegisters/SM
Pascal (P100)2016566464 KB256 KB
Volta (V100)2017806496 KB256 KB
Ampere (A100)202010864164 KB256 KB
Hopper (H100)2022132128256 KB256 KB
Blackwell (B200)2024160+128256 KB256 KB

Warp Scheduling

Resource Partitioning per Thread Block

 Thread block requests:
   - 256 threads (8 warps)
   - 32 registers per thread = 8192 registers
   - 4 KB shared memory

 SM capacity: 65536 registers, 256 KB shared mem, 64 warps
   → Can fit: min(65536/8192, 256K/4K, 64/8, 32 blocks) = 8 blocks
   → 64 warps active → 100% occupancy

Performance Optimization Based on SM

BottleneckSymptomSolution
Low occupancyFew active warpsReduce registers or shared mem per block
Register spillSlow local memory accessReduce variables, use __launch_bounds__
Shared mem limitedCan't fit all dataTile the computation
Compute boundAll cores busyAlgorithmic optimization
Memory boundCores waitingImprove coalescing, caching

GPU SM architecture is the hardware foundation that every CUDA optimization decision ultimately targets — understanding how warps are scheduled, how registers and shared memory are partitioned across thread blocks, and how many SMs compose a given GPU determines whether a kernel achieves 20% or 90% of theoretical peak throughput, making SM architecture knowledge the essential bridge between writing correct GPU code and writing fast GPU code.

gpu sm architecturestreaming multiprocessorcuda coregpu compute unitsm design

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.