Distributed Data Parallel (DDP) Training is the foundational parallelism strategy where the same model is replicated across multiple GPUs and each replica processes different data batches — synchronizing gradients through allreduce operations so that all replicas maintain identical weights, providing near-linear scaling with GPU count for models that fit in single-GPU memory, and serving as the simplest and most efficient form of distributed training that underlies virtually all multi-GPU neural network training.
How DDP Works
``
Setup: Model replicated on N GPUs (rank 0, 1, ..., N-1)
Each training step:
1. Each GPU gets a DIFFERENT mini-batch (data parallelism)
GPU 0: batch[0:B] GPU 1: batch[B:2B] ... GPU N-1: batch[(N-1)B:NB]
2. Each GPU runs forward + backward independently
GPU 0: loss₀, grads₀ GPU 1: loss₁, grads₁ ...
3. AllReduce: Average gradients across all GPUs
avg_grad = (grad₀ + grad₁ + ... + grad_{N-1}) / N
Every GPU now has identical averaged gradients
4. Each GPU applies identical optimizer update
Result: All GPUs maintain identical model weights
`
AllReduce Algorithms
| Algorithm | Communication Volume | Steps | Best For |
|-----------|--------------------|----|----------|
| Ring AllReduce | 2(N-1)/N × data_size | 2(N-1) | Large messages, bandwidth-bound |
| Tree AllReduce | 2 × data_size | 2 log N | Small messages, latency-bound |
| Recursive halving-doubling | data_size | 2 log N | Power-of-2 GPU counts |
| NCCL (NVIDIA) | Optimized auto-select | Auto | Default for NVIDIA GPUs |
PyTorch DDP Implementation
`python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize process group
dist.init_process_group(backend="nccl") # NCCL for GPU
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
# Wrap model
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# Use DistributedSampler for data loading
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=batch_per_gpu, sampler=sampler)
# Training loop (identical to single-GPU except sampler)
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # shuffle differently each epoch
for batch in loader:
loss = model(batch)
loss.backward() # DDP hooks fire allreduce automatically
optimizer.step()
optimizer.zero_grad()
`
Communication-Computation Overlap
`
DDP optimization: Don't wait for ALL gradients before communicating
Bucket-based allreduce:
Backward pass computes gradients layer by layer (last → first)
As each bucket fills, start allreduce for that bucket
Computation and communication overlap → hides latency
Timeline:
GPU compute: [backward L32] [backward L31] [backward L30] ...
Network: [allreduce bucket 1] [allreduce bucket 2] ...
`
Scaling Efficiency
| GPUs | Ideal Speedup | Actual Speedup | Efficiency |
|------|-------------|---------------|------------|
| 1 | 1× | 1× | 100% |
| 2 | 2× | 1.95× | 97.5% |
| 4 | 4× | 3.80× | 95% |
| 8 | 8× | 7.20× | 90% |
| 32 | 32× | 26× | 81% |
| 64 | 64× | 48× | 75% |
| 256 | 256× | 160× | 62% |
DDP vs. Other Parallelism
| Strategy | When to Use | Limitation |
|----------|------------|------------|
| DDP | Model fits in one GPU | Can't train larger-than-GPU models |
| FSDP / ZeRO | Model doesn't fit in one GPU | Communication overhead |
| Pipeline Parallel | Very deep models | Bubble overhead |
| Tensor Parallel | Very wide layers | Requires fast interconnect |
Effective Batch Size
`
Effective batch size = per_gpu_batch × num_gpus
Example: 8 GPUs × 32 per GPU = 256 effective batch size
Implication: May need to adjust learning rate
Linear scaling rule: lr × num_gpus (with warmup)
Square root scaling: lr × √num_gpus (more conservative)
``
Distributed Data Parallel is the workhorse of multi-GPU training that scales linearly for models fitting in GPU memory — its simplicity (replicate model, split data, average gradients) and near-optimal communication efficiency through bucketed allreduce make DDP the default starting point for any distributed training job, with more complex parallelism strategies (FSDP, tensor, pipeline) only needed when model size exceeds single-GPU capacity.