HBM (High Bandwidth Memory) is specialized 3D-stacked DRAM designed to provide massive memory bandwidth to GPUs and accelerators — achieving 2-5 TB/s bandwidth versus ~100 GB/s for standard DDR, this technology is critical for LLM inference where moving weights from memory to compute is the primary bottleneck.
What Is HBM?
- Definition: 3D-stacked DRAM connected via silicon interposer.
- Innovation: Wide interface (1024+ bits) through vertical stacking.
- Bandwidth: 2-5× higher than any other memory technology.
- Use: AI accelerators (H100, MI300), HPC, graphics.
Why Bandwidth Matters for AI
- Memory-Bound: LLM inference is limited by memory bandwidth, not compute.
- Weight Movement: Every token requires loading all model weights.
- Bottleneck Equation: Tokens/sec ≤ Bandwidth / (2 × Model Size).
- More Bandwidth = More Tokens/Second.
Memory Technology Comparison
Memory | Bandwidth | Capacity | Cost | Use Case
----------|-----------|-------------|----------|------------------
HBM3e | 4.8 TB/s | 141 GB | Very high| H200, MI300X
HBM3 | 3.35 TB/s | 80 GB | High | H100
HBM2e | 2.0 TB/s | 80 GB | High | A100
GDDR6X | 1.0 TB/s | 24 GB | Medium | RTX 4090
GDDR6 | 0.5 TB/s | 16-48 GB | Medium | RTX 4080
DDR5 | 0.1 TB/s | 128+ GB | Low | CPU RAM
How HBM Works
Architecture:
┌─────────────────────────────────────────┐
│ GPU Die │
├─────────────────────────────────────────┤
│ Silicon Interposer │
├─────┬─────┬─────┬─────┬─────┬─────┬─────┤
│HBM │HBM │HBM │HBM │HBM │HBM │HBM │
│Stack│Stack│Stack│Stack│Stack│Stack│Stack│
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Each HBM stack:
- 8-12 DRAM dies stacked vertically
- Connected via Through-Silicon Vias (TSVs)
- 1024-bit wide interface per stack
- H100 has 5 stacks = 5120-bit total width
Bandwidth Calculation:
HBM3 (H100):
Width: 5 stacks × 1024 bits = 5120 bits
Speed: 5.2 Gbps per pin
Bandwidth: 5120 × 5.2 Gbps / 8 = 3.35 TB/s
LLM Inference Throughput Limit
Theoretical Maximum:
Max tokens/sec = Memory Bandwidth / Bytes per Token
For 70B model (FP16 = 140 GB):
H100: 3.35 TB/s / 140 GB = 24 tokens/sec (theoretical max)
A100: 2.0 TB/s / 140 GB = 14 tokens/sec
RTX 4090: 1.0 TB/s / 140 GB = 7 tokens/sec
Reality is ~70-80% of theoretical due to overhead
Impact on Different Models:
Model | Size (FP16) | H100 Max | A100 Max
--------|-------------|----------|----------
7B | 14 GB | 239 tok/s| 143 tok/s
13B | 26 GB | 129 tok/s| 77 tok/s
70B | 140 GB | 24 tok/s | 14 tok/s
405B | 810 GB | 4 tok/s* | 2.5 tok/s*
* Multi-GPU required
HBM Generations
Generation | Bandwidth/stack | GPU Example | Year
-----------|-----------------|-------------|------
HBM1 | 128 GB/s | Fiji | 2015
HBM2 | 256 GB/s | V100 | 2016
HBM2e | 450 GB/s | A100 | 2020
HBM3 | 665 GB/s | H100 | 2022
HBM3e | 1.2 TB/s | H200 | 2024
HBM4 | 2+ TB/s | (Future) | 2025+
Implications for ML
GPU Selection:
- For LLM inference, prioritize bandwidth over FLOPS.
- H100 vs. A100: Only 2× FLOPS but 1.7× bandwidth.
- RTX 4090: Great for small models, limited for 70B+.
Quantization Impact:
Quantization reduces model size → more tokens/sec:
70B model:
FP16 (140 GB): 24 tok/s on H100
INT8 (70 GB): 48 tok/s on H100
INT4 (35 GB): 96 tok/s on H100
4-bit enables ~4× throughput!
Batching Benefit:
Single request: Bandwidth limited
Batching N requests: Same bandwidth reads, N outputs
Batch size 1: 24 tok/s (memory bound)
Batch size 8: 140 tok/s (becoming compute bound)
Batch size 32: 500 tok/s (compute bound)
HBM and memory bandwidth are the physics that govern LLM inference performance — understanding this fundamental constraint explains why quantization, batching, and newer GPUs with more HBM are essential for efficient AI serving.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.