Home Knowledge Base Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel (FSDP) is the advanced distributed training technique that shards model parameters, gradients, and optimizer states across GPUs — each GPU stores only 1/N of the model (N=number of GPUs), gathering required parameters on-demand during forward/backward passes and immediately discarding them, reducing per-GPU memory from O(model_size) to O(model_size/N), enabling training of 100B+ parameter models on 8×40GB GPUs that would otherwise require 400GB+ per GPU, achieving 80-90% scaling efficiency despite increased communication overhead.

FSDP Sharding Strategy:

ZeRO Stages (DeepSpeed):

FSDP Implementation (PyTorch):

Communication Patterns:

Performance Optimization:

Hybrid Sharding:

Memory Breakdown:

Comparison with DDP:

Scaling to Extreme Sizes:

Debugging FSDP:

Fully Sharded Data Parallel is the memory-efficiency breakthrough that enables training of models 10-100× larger than GPU memory — by sharding all model state across GPUs and carefully orchestrating communication, FSDP makes training 100B+ parameter models accessible on modest GPU clusters, democratizing large-scale model training and enabling researchers to push the boundaries of model scale without requiring massive infrastructure investments.

fully sharded data parallel fsdpzero optimizer deepspeedsharded optimizer statefsdp memory efficiencyzero redundancy optimizer

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.