Home Knowledge Base Zero Redundancy Optimizer (ZeRO)

Zero Redundancy Optimizer (ZeRO)

Keywords: zero redundancy optimizer,zero deepspeed,memory efficient optimizer,optimizer state partitioning,fsdp fully sharded


Zero Redundancy Optimizer (ZeRO) is the memory optimization technique that eliminates redundant storage of optimizer states, gradients, and parameters across data parallel processes — partitioning these memory components across GPUs so each device stores only 1/N of the total, enabling training of models N× larger than single-GPU capacity while maintaining data parallelism's computational efficiency and ease of implementation.

Memory Breakdown in Distributed Training:

ZeRO Stage 1 (Optimizer State Partitioning):

ZeRO Stage 2 (+ Gradient Partitioning):

ZeRO Stage 3 (+ Parameter Partitioning):

ZeRO-Offload:

ZeRO-Infinity:

FSDP (Fully Sharded Data Parallel):

Communication Patterns:

Optimization Techniques:

Combining with Other Parallelism:

Performance Characteristics:

Practical Guidelines:

Framework Support:

Zero Redundancy Optimizer is the breakthrough that democratized large-scale model training — by eliminating redundant memory storage across data parallel processes, it enables researchers and practitioners to train models orders of magnitude larger than previously possible on the same hardware, making frontier AI research accessible beyond the largest tech companies.


Source: ChipFoundryServicesSearch this topicAsk CFSGPT

zero redundancy optimizerzero deepspeedmemory efficient optimizeroptimizer state partitioningfsdp fully sharded

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.