DeepSpeed ZeRO

DeepSpeed ZeRO — a memory optimization strategy that eliminates redundant storage of model states across data-parallel GPUs, enabling training of models 10-100x larger than standard data parallelism.

The Redundancy Problem

- Standard DDP: Every GPU stores a full copy of model states:
- Parameters (fp16): 2 bytes per param
- Gradients (fp16): 2 bytes per param
- Optimizer states (fp32 params + momentum + variance): 12 bytes per param (Adam)
- Total: ~16 bytes per parameter per GPU. Completely redundant!

ZeRO Stages

- ZeRO-1: Partition optimizer states across GPUs. Each GPU stores 1/N of optimizer state. ~4x memory reduction
- ZeRO-2: + Partition gradients. Each GPU stores 1/N of gradients. ~8x reduction
- ZeRO-3: + Partition parameters. Each GPU stores 1/N of parameters. ~N× reduction. Model can be larger than single GPU memory!

Example: 10B Parameter Model with 8 GPUs

| Strategy | Memory per GPU |
|---|---|
| Standard DDP | ~160 GB (doesn't fit!) |
| ZeRO-1 | ~51 GB |
| ZeRO-2 | ~31 GB |
| ZeRO-3 | ~21 GB (fits in 40GB A100) |

ZeRO-Offload / ZeRO-Infinity

- Offload optimizer states and/or parameters to CPU RAM or NVMe SSD
- Enables training trillion-parameter models on limited GPU hardware

Usage: deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

ZeRO is the most impactful memory optimization for LLM training — it's what makes training 70B+ parameter models practical.

Want to learn more?