Home Knowledge Base Distributed Data Parallel (DDP) Training

Distributed Data Parallel (DDP) Training is the foundational parallelism strategy where the same model is replicated across multiple GPUs and each replica processes different data batches — synchronizing gradients through allreduce operations so that all replicas maintain identical weights, providing near-linear scaling with GPU count for models that fit in single-GPU memory, and serving as the simplest and most efficient form of distributed training that underlies virtually all multi-GPU neural network training.

How DDP Works

Setup: Model replicated on N GPUs (rank 0, 1, ..., N-1)

Each training step:
  1. Each GPU gets a DIFFERENT mini-batch (data parallelism)
     GPU 0: batch[0:B]   GPU 1: batch[B:2B]   ...   GPU N-1: batch[(N-1)B:NB]
  
  2. Each GPU runs forward + backward independently
     GPU 0: loss₀, grads₀    GPU 1: loss₁, grads₁   ...
  
  3. AllReduce: Average gradients across all GPUs
     avg_grad = (grad₀ + grad₁ + ... + grad_{N-1}) / N
     Every GPU now has identical averaged gradients
  
  4. Each GPU applies identical optimizer update
     Result: All GPUs maintain identical model weights

AllReduce Algorithms

AlgorithmCommunication VolumeStepsBest For
Ring AllReduce2(N-1)/N × data_size2(N-1)Large messages, bandwidth-bound
Tree AllReduce2 × data_size2 log NSmall messages, latency-bound
Recursive halving-doublingdata_size2 log NPower-of-2 GPU counts
NCCL (NVIDIA)Optimized auto-selectAutoDefault for NVIDIA GPUs

PyTorch DDP Implementation

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group(backend="nccl")  # NCCL for GPU
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler for data loading
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=batch_per_gpu, sampler=sampler)

# Training loop (identical to single-GPU except sampler)
for epoch in range(num_epochs):
    sampler.set_epoch(epoch)  # shuffle differently each epoch
    for batch in loader:
        loss = model(batch)
        loss.backward()       # DDP hooks fire allreduce automatically
        optimizer.step()
        optimizer.zero_grad()

Communication-Computation Overlap

DDP optimization: Don't wait for ALL gradients before communicating

Bucket-based allreduce:
  Backward pass computes gradients layer by layer (last → first)
  As each bucket fills, start allreduce for that bucket
  Computation and communication overlap → hides latency

Timeline:
  GPU compute:  [backward L32] [backward L31] [backward L30] ...
  Network:                [allreduce bucket 1]  [allreduce bucket 2] ...

Scaling Efficiency

GPUsIdeal SpeedupActual SpeedupEfficiency
1100%
21.95×97.5%
43.80×95%
87.20×90%
3232×26×81%
6464×48×75%
256256×160×62%

DDP vs. Other Parallelism

StrategyWhen to UseLimitation
DDPModel fits in one GPUCan't train larger-than-GPU models
FSDP / ZeROModel doesn't fit in one GPUCommunication overhead
Pipeline ParallelVery deep modelsBubble overhead
Tensor ParallelVery wide layersRequires fast interconnect

Effective Batch Size

Effective batch size = per_gpu_batch × num_gpus

Example: 8 GPUs × 32 per GPU = 256 effective batch size
Implication: May need to adjust learning rate
  Linear scaling rule: lr × num_gpus (with warmup)
  Square root scaling: lr × √num_gpus (more conservative)

Distributed Data Parallel is the workhorse of multi-GPU training that scales linearly for models fitting in GPU memory — its simplicity (replicate model, split data, average gradients) and near-optimal communication efficiency through bucketed allreduce make DDP the default starting point for any distributed training job, with more complex parallelism strategies (FSDP, tensor, pipeline) only needed when model size exceeds single-GPU capacity.

data parallel distributedddp pytorchdistributed data paralleldata parallel trainingallreduce training

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.