Data Parallelism and Gradient Synchronization

Data Parallelism and Gradient Synchronization is the foundational distributed training approach where identical model replicas process different data samples, aggregate gradients across replicas, and synchronously apply updates to maintain training consistency.

Data Distributed Parallel (DDP) in PyTorch

- DDP Architecture: Each GPU runs independent data loader, processes batch, computes gradients. Gradients collected via all-reduce, averaged, applied to local model.
- Backward Hook Integration: PyTorch hooks gradient computation, automatically triggers all-reduce upon backward pass completion. Transparent to user code.
- Communication Overhead: All-reduce requires 2× gradient size bandwidth (send + receive). For 1B parameter models, ~8 GB all-reduce per iteration.
- Synchronous Training: All replicas coordinate at gradient application. Stragglers (slower GPUs) block fastest GPUs, reducing effective throughput (synchronized by slowest device).

ZeRO (Zero Redundancy Optimizer) Stages

- ZeRO Stage 1 (Gradient Partitioning): Gradients partitioned across GPUs. GPU i stores gradient partitions [i×n:(i+1)×n]. Reduces gradient memory by factor of N_gpus.
- ZeRO Stage 2 (Gradient + Optimizer State Partitioning): Optimizer state (momentum, variance) also partitioned. Memory reduction: 4-6x (for Adam: 2 gradient copies + 2 momentum + 2 variance).
- ZeRO Stage 3 (Parameter Partitioning): Model weights themselves partitioned. GPU i stores subset of weights. Requires weight broadcast before forward pass (communication overlapped with computation).
- ZeRO-Offload: Optimizer state offloaded to CPU. Reduces GPU memory but requires PCIe bandwidth for state updates (typically 10-20 GB/s). Viable for CPU-rich systems.

Gradient Compression Techniques

- PowerSGD: Rank-reduced low-rank approximation of gradient matrices. Compresses gradients 10-100x with <1% convergence slowdown. Requires extra computation (SVD).
- 1-bit Adam: Quantize gradients to 1-bit per parameter (sign bit only) with momentum compensation. 32x compression but requires careful learning rate tuning.
- Top-K Sparsification: Only communicate top-K gradient values (largest magnitude). Reduces communication 10-100x for sparse gradient models (certain domains like NLP).
- Error Feedback/Momentum Correction: Quantization error accumulated in momentum buffer, compensated in future updates. Prevents convergence degradation from compression.

All-Reduce Communication Patterns

- Ring All-Reduce: Logical ring of N GPUs, gradients passed sequentially. Bandwidth-efficient (uses full link utilization) but latency = O(N).
- Tree All-Reduce: Binary tree minimizes latency O(log N) but underutilizes bandwidth in over-subscribed networks. Cadence slower than ring for large clusters.
- Hybrid Approaches: Two-level hierarchies combine benefits. Intra-rack tree, inter-rack ring. Typical cluster topology shapes algorithm selection.
- Pipelined All-Reduce: Partition gradients into chunks, stream chunks through reduction pipeline. Overlaps communication phases across multiple GPUs.

Overlap of Backward Pass with All-Reduce

- Bucket-Based Gradient Accumulation: Gradients accumulated in buckets (e.g., 25 MB each). Upon bucket completion, all-reduce triggered immediately (not waiting for full backward pass).
- Pipelined All-Reduce: Multiple all-reduces in-flight concurrently. GPU 0 all-reduces bucket 0 while GPU 1 backward-passes bucket 1, GPU 2 computes bucket 2 forward.
- Communication Cost Amortization: Gradient computation (~70% of backward cost), all-reduce (~20-30%), gradient application (~5%). Overlap hides ~80% of all-reduce latency.
- Network Saturation: Full overlap requires sufficient computation between synchronization points. Bandwidth-limited clusters struggle to hide all-reduce even with pipelining.

Gradient Synchronization and Convergence

- Synchronization Semantics: All replicas must see identical gradient sums before parameter updates. Asynchronous approaches (parameter server) degrade convergence.
- Variance Reduction: Synchronous averaging reduces variance in stochastic gradient. Larger effective batch size (N_gpu × batch_size_per_gpu) → lower gradient variance.
- Learning Rate Scaling: Learning rate typically increased proportionally to batch size. 10x larger batch_size → 10x higher learning rate (with linear scaling rule).
- Communication Cost vs Convergence: Trade-off between communication frequency (more frequent sync) and gradient staleness (less frequent sync). Optimal sync interval depends on model, batch size, cluster size.

Data Parallelism and Gradient Synchronization

Want to learn more?