Distributed Data Parallel (DDP) Training

Distributed Data Parallel (DDP) Training is the foundational parallelism strategy where the same model is replicated across multiple GPUs and each replica processes different data batches — synchronizing gradients through allreduce operations so that all replicas maintain identical weights, providing near-linear scaling with GPU count for models that fit in single-GPU memory, and serving as the simplest and most efficient form of distributed training that underlies virtually all multi-GPU neural network training.

How DDP Works

``Setup: Model replicated on N GPUs (rank 0, 1, ..., N-1)

Each training step: 1. Each GPU gets a DIFFERENT mini-batch (data parallelism) GPU 0: batch[0:B] GPU 1: batch[B:2B] ... GPU N-1: batch[(N-1)B:NB] 2. Each GPU runs forward + backward independently GPU 0: loss₀, grads₀ GPU 1: loss₁, grads₁ ... 3. AllReduce: Average gradients across all GPUs avg_grad = (grad₀ + grad₁ + ... + grad_{N-1}) / N Every GPU now has identical averaged gradients 4. Each GPU applies identical optimizer update Result: All GPUs maintain identical model weights`

AllReduce Algorithms

| Algorithm | Communication Volume | Steps | Best For | |-----------|--------------------|----|----------| | Ring AllReduce | 2(N-1)/N × data_size | 2(N-1) | Large messages, bandwidth-bound | | Tree AllReduce | 2 × data_size | 2 log N | Small messages, latency-bound | | Recursive halving-doubling | data_size | 2 log N | Power-of-2 GPU counts | | NCCL (NVIDIA) | Optimized auto-select | Auto | Default for NVIDIA GPUs |

PyTorch DDP Implementation

`python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group dist.init_process_group(backend="nccl") # NCCL for GPU local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank)

# Wrap model model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler for data loading sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = DataLoader(dataset, batch_size=batch_per_gpu, sampler=sampler)

# Training loop (identical to single-GPU except sampler) for epoch in range(num_epochs): sampler.set_epoch(epoch) # shuffle differently each epoch for batch in loader: loss = model(batch) loss.backward() # DDP hooks fire allreduce automatically optimizer.step() optimizer.zero_grad()`

Communication-Computation Overlap

`DDP optimization: Don't wait for ALL gradients before communicating

Bucket-based allreduce: Backward pass computes gradients layer by layer (last → first) As each bucket fills, start allreduce for that bucket Computation and communication overlap → hides latency

Timeline: GPU compute: [backward L32] [backward L31] [backward L30] ... Network: [allreduce bucket 1] [allreduce bucket 2] ...`

Scaling Efficiency

| GPUs | Ideal Speedup | Actual Speedup | Efficiency | |------|-------------|---------------|------------| | 1 | 1× | 1× | 100% | | 2 | 2× | 1.95× | 97.5% | | 4 | 4× | 3.80× | 95% | | 8 | 8× | 7.20× | 90% | | 32 | 32× | 26× | 81% | | 64 | 64× | 48× | 75% | | 256 | 256× | 160× | 62% |

DDP vs. Other Parallelism

| Strategy | When to Use | Limitation | |----------|------------|------------| | DDP | Model fits in one GPU | Can't train larger-than-GPU models | | FSDP / ZeRO | Model doesn't fit in one GPU | Communication overhead | | Pipeline Parallel | Very deep models | Bubble overhead | | Tensor Parallel | Very wide layers | Requires fast interconnect |

Effective Batch Size

`Effective batch size = per_gpu_batch × num_gpus

Example: 8 GPUs × 32 per GPU = 256 effective batch size Implication: May need to adjust learning rate Linear scaling rule: lr × num_gpus (with warmup) Square root scaling: lr × √num_gpus (more conservative)``

Distributed Data Parallel is the workhorse of multi-GPU training that scales linearly for models fitting in GPU memory — its simplicity (replicate model, split data, average gradients) and near-optimal communication efficiency through bucketed allreduce make DDP the default starting point for any distributed training job, with more complex parallelism strategies (FSDP, tensor, pipeline) only needed when model size exceeds single-GPU capacity.

Distributed Data Parallel (DDP) Training

Want to learn more?