Home Knowledge Base HBM (High Bandwidth Memory)

HBM (High Bandwidth Memory) is specialized 3D-stacked DRAM designed to provide massive memory bandwidth to GPUs and accelerators — achieving 2-5 TB/s bandwidth versus ~100 GB/s for standard DDR, this technology is critical for LLM inference where moving weights from memory to compute is the primary bottleneck.

What Is HBM?

Why Bandwidth Matters for AI

Memory Technology Comparison

Memory    | Bandwidth | Capacity    | Cost     | Use Case
----------|-----------|-------------|----------|------------------
HBM3e     | 4.8 TB/s  | 141 GB      | Very high| H200, MI300X
HBM3      | 3.35 TB/s | 80 GB       | High     | H100
HBM2e     | 2.0 TB/s  | 80 GB       | High     | A100
GDDR6X    | 1.0 TB/s  | 24 GB       | Medium   | RTX 4090
GDDR6     | 0.5 TB/s  | 16-48 GB    | Medium   | RTX 4080
DDR5      | 0.1 TB/s  | 128+ GB     | Low      | CPU RAM

How HBM Works

Architecture:

┌─────────────────────────────────────────┐
│               GPU Die                   │
├─────────────────────────────────────────┤
│           Silicon Interposer            │
├─────┬─────┬─────┬─────┬─────┬─────┬─────┤
│HBM  │HBM  │HBM  │HBM  │HBM  │HBM  │HBM  │
│Stack│Stack│Stack│Stack│Stack│Stack│Stack│
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Each HBM stack:
- 8-12 DRAM dies stacked vertically
- Connected via Through-Silicon Vias (TSVs)
- 1024-bit wide interface per stack
- H100 has 5 stacks = 5120-bit total width

Bandwidth Calculation:

HBM3 (H100):
Width: 5 stacks × 1024 bits = 5120 bits
Speed: 5.2 Gbps per pin
Bandwidth: 5120 × 5.2 Gbps / 8 = 3.35 TB/s

LLM Inference Throughput Limit

Theoretical Maximum:

Max tokens/sec = Memory Bandwidth / Bytes per Token

For 70B model (FP16 = 140 GB):
H100: 3.35 TB/s / 140 GB = 24 tokens/sec (theoretical max)
A100: 2.0 TB/s / 140 GB = 14 tokens/sec
RTX 4090: 1.0 TB/s / 140 GB = 7 tokens/sec

Reality is ~70-80% of theoretical due to overhead

Impact on Different Models:

Model   | Size (FP16) | H100 Max | A100 Max
--------|-------------|----------|----------
7B      | 14 GB       | 239 tok/s| 143 tok/s
13B     | 26 GB       | 129 tok/s| 77 tok/s
70B     | 140 GB      | 24 tok/s | 14 tok/s
405B    | 810 GB      | 4 tok/s* | 2.5 tok/s*
* Multi-GPU required

HBM Generations

Generation | Bandwidth/stack | GPU Example | Year
-----------|-----------------|-------------|------
HBM1       | 128 GB/s        | Fiji        | 2015
HBM2       | 256 GB/s        | V100        | 2016
HBM2e      | 450 GB/s        | A100        | 2020
HBM3       | 665 GB/s        | H100        | 2022
HBM3e      | 1.2 TB/s        | H200        | 2024
HBM4       | 2+ TB/s         | (Future)    | 2025+

Implications for ML

GPU Selection:

Quantization Impact:

Quantization reduces model size → more tokens/sec:

70B model:
FP16 (140 GB): 24 tok/s on H100
INT8 (70 GB):  48 tok/s on H100
INT4 (35 GB):  96 tok/s on H100

4-bit enables ~4× throughput!

Batching Benefit:

Single request: Bandwidth limited
Batching N requests: Same bandwidth reads, N outputs

Batch size 1:  24 tok/s (memory bound)
Batch size 8:  140 tok/s (becoming compute bound)
Batch size 32: 500 tok/s (compute bound)

HBM and memory bandwidth are the physics that govern LLM inference performance — understanding this fundamental constraint explains why quantization, batching, and newer GPUs with more HBM are essential for efficient AI serving.

memory bandwidth highhbm memorygpu memoryvraminference bottlenecka100h100

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.