Home Knowledge Base The problem

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across data parallel devices. The problem: Data parallelism replicates everything on each device - wasteful memory usage. 175B model needs 175B parameters x N devices. ZeRO insight: Optimizer states (Adam moments), gradients, and parameters dont all need to be replicated. Partition them. ZeRO stages: Stage 1: Partition optimizer states. 4x memory reduction (Adam stores 4x params). Stage 2: Also partition gradients. 8x reduction. Stage 3: Also partition parameters. Linear reduction with device count. How it works: Each device owns shard of params. All-gather to reconstruct needed params for forward/backward, reduce-scatter gradients, update local shard. Communication overhead: More communication than vanilla data parallel, but enables training otherwise-impossible model sizes. Memory savings: ZeRO-3 can train 175B model on 8 GPUs that couldnt individually fit 175B. DeepSpeed: Microsoft library implementing ZeRO. Industry standard for large-scale training. ZeRO-Offload: Offload to CPU memory for even larger models. ZeRO-Infinity: Offload to NVMe for multi-trillion parameter models.

zero (zero redundancy optimizer)zerozero redundancy optimizermodel training

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.