Pipeline Parallelism for LLM Training

Pipeline Parallelism for LLM Training is a model parallelism strategy that partitions a large neural network into sequential stages assigned to different devices, processing multiple micro-batches simultaneously through the pipeline to maximize hardware utilization — this approach is essential for training models too large to fit on a single GPU while maintaining high throughput.

Pipeline Parallelism Fundamentals:
- Stage Partitioning: the model is divided into K contiguous groups of layers (stages), each assigned to a separate GPU — for a 96-layer transformer, 8 GPUs would each handle 12 layers
- Micro-Batching: the global mini-batch is split into M micro-batches that flow through the pipeline sequentially — while stage K processes micro-batch m, stage K-1 can process micro-batch m+1, enabling concurrent execution
- Pipeline Bubble: at the start and end of each mini-batch, some stages are idle waiting for data to flow through — the bubble fraction is approximately (K-1)/(M+K-1), so more micro-batches reduce overhead
- Memory vs. Throughput Tradeoff: more stages reduce per-GPU memory requirements but increase pipeline bubble overhead and inter-stage communication

GPipe Schedule:
- Forward Pass First: all M micro-batches execute their forward passes sequentially through all K stages before any backward pass begins — requires storing O(M×K) activations in memory
- Backward Pass: after all forwards complete, backward passes execute in reverse order through the pipeline — gradient accumulation across micro-batches before optimizer step
- Bubble Fraction: with M micro-batches and K stages, the bubble is (K-1)/M of total compute time — GPipe recommends M ≥ 4K to keep bubble under 25%
- Memory Impact: storing all intermediate activations for M micro-batches is costly — activation checkpointing reduces memory from O(M×K×L) to O(M×K) by recomputing activations during backward pass

1F1B (One Forward One Backward) Schedule:
- Interleaved Execution: after the pipeline fills (K-1 forward passes), each stage alternates between one forward and one backward pass — steady-state pattern is F-B-F-B-F-B
- Memory Advantage: only K micro-batches' activations are stored simultaneously (rather than M in GPipe) — reduces peak memory by M/K factor
- Same Bubble: the 1F1B schedule has the same bubble fraction as GPipe — (K-1)/(M+K-1) — but dramatically lower memory requirements
- PipeDream Flush: variant that accumulates gradients across micro-batches and performs a single optimizer step per mini-batch — avoids weight staleness issues of the original PipeDream

Interleaved Pipeline Parallelism (Megatron-LM):
- Virtual Stages: each GPU holds multiple non-contiguous stages (e.g., GPU 0 handles stages 0, 4, 8 in a 12-stage pipeline across 4 GPUs) — creates a virtual pipeline of V×K stages
- Reduced Bubble: bubble fraction decreases to (K-1)/(V×M+K-1) where V is the number of virtual stages per GPU — with V=4, bubble overhead drops by ~4× compared to standard pipeline
- Increased Communication: non-contiguous stage assignment requires more inter-GPU communication since activations must travel between GPUs more frequently
- Optimal Balance: typically V=2-4 provides the best tradeoff between reduced bubble and increased communication overhead

Integration with Other Parallelism Dimensions:
- 3D Parallelism: combines pipeline parallelism (inter-layer), tensor parallelism (intra-layer), and data parallelism — standard approach for training 100B+ parameter models
- Megatron-LM Configuration: for a 175B parameter model across 1024 GPUs — 8-way tensor parallelism × 16-way pipeline parallelism × 8-way data parallelism
- Stage Balancing: unequal computation per stage (embedding layers vs. transformer blocks) creates load imbalance — careful partitioning ensures <5% imbalance across stages
- Cross-Stage Communication: activation tensors transferred between pipeline stages via point-to-point GPU communication (NCCL send/recv) — bandwidth requirement scales with hidden dimension and micro-batch size

Challenges and Solutions:
- Weight Staleness: in async pipeline approaches, different micro-batches see different weight versions — PipeDream-2BW maintains two weight versions to bound staleness
- Batch Normalization: running statistics computed on micro-batches within a single stage don't reflect global batch statistics — Layer Normalization (used in transformers) avoids this issue entirely
- Fault Tolerance: if one stage's GPU fails, the entire pipeline stalls — elastic pipeline rescheduling can reassign stages to remaining GPUs with temporary throughput reduction

Pipeline parallelism enables training models with trillions of parameters by distributing memory requirements across many devices, but achieving >80% hardware utilization requires careful balancing of micro-batch count, stage partitioning, and integration with tensor and data parallelism.

Pipeline Parallelism for LLM Training

Want to learn more?