Pipeline Parallelism

Keywords: pipeline parallelism model parallel,gpipe schedule,1f1b pipeline schedule,pipeline bubble overhead,inter stage activation

Pipeline Parallelism is a model parallelism technique that divides neural network layers across multiple devices, enabling concurrent forward and backward passes on different micro-batches to hide latency and maintain high GPU utilization.

GPipe and Synchronous Pipelining

- GPipe Architecture (Google): First practical pipeline parallelism at scale. Splits model layers across sequential GPU stages (Stage_0 → Stage_1 → ... → Stage_N).
- Micro-Batching Strategy: Input batch (size B) divided into M micro-batches (size B/M). Each micro-batch propagates sequentially through pipeline stages.
- Forward Pass Pipelining: Stage 0 computes micro-batch 1 while Stage 1 computes micro-batch 0. Overlaps computation across stages, reducing idle time.
- Gradient Accumulation: Gradients from M micro-batches accumulated and applied once (equivalent to large-batch training). Effective batch size increases without memory pressure.

1F1B (One-Forward-One-Backward) Pipeline Schedule

- Synchronous Schedule: GPipe maintains fixed schedule (all F passes before all B passes). Requires buffering all activations until backward phase.
- 1F1B Asynchronous Schedule: Interleaves forward and backward passes. When backward computation available, immediately execute instead of waiting for forward to complete.
- Activation Memory Reduction: 1F1B reduces peak activation memory from O(N_stage × batch_size × model_depth) to O(batch_size × model_depth) by reusing buffers.
- PipeDream Implementation: 1F1B extended to handle weight update timing, gradient averaging. Critical for large-scale distributed training.

Pipeline Bubble Overhead

- Bubble Fraction: Percentage of GPU cycles spent idle (no useful computation). Bubble = (N_stage - 1) / (N_stage + M - 1), where N_stage = stages, M = micro-batches.
- Minimizing Bubbles: Increase micro-batches M. With M >> N_stage, bubble fraction approaches (N_stage / M) → 0. Requires sufficient memory bandwidth per GPU.
- Optimal Micro-Batch Count: Typically M = 3-5 × N_stage balances memory and bubble overhead. For 8 stages, use 24-40 micro-batches.
- Load Imbalance: Heterogeneous stage sizes (early stages deeper than later) create variable compute time. Faster stages idle, slower stages bottleneck. Requires careful layer partitioning.

Inter-Stage Activation Storage

- Activation Tensors: During forward pass, intermediate activations stored at each stage boundary (input to stage, output from stage). Required for backward pass gradient computation.
- Memory Footprint: Activation memory = (number of micro-batches in-flight) × (activation tensor size per stage) × (number of layers per stage).
- Checkpoint-Recomputation Hybrid: Store checkpoints at stage boundaries, recompute intermediate activations during backward pass. Reduces memory from O(layers) to O(1) per stage.
- Communication Overhead: Activations streamed between stages over network (inter-chip or intra-cluster). Bandwidth requirement: ~10-100 GB/s typical for large models.

Communication Overlapping with Computation

- Pipelining at Machine Level: While Stage 1 computes backward pass, Stage 0 computes forward pass on next micro-batch. Network communication of activations hidden behind computation.
- Gradient Streaming: Gradients propagate backward stages asynchronously. All-reduce across replicas (data parallelism + pipeline parallelism) overlapped with forward pass.
- Synchronization Points: Wait-free pipelines minimize hard synchronization. Soft synchronization (loose coupling) permits stages to operate at slightly different rates.

Real-World Implementation Details

- Zero Redundancy Optimizer (ZeRO) Integration: ZeRO stages 1/2/3 combined with pipeline parallelism. Stage 3 (parameter sharding) demands careful activation checkpoint management.
- Gradient Accumulation Steps: Typically 4-16 gradient accumulation steps combined with 4 micro-batches through 8 pipeline stages. Total effective batch size = 32-128.
- Convergence Properties: Pipeline parallelism with 1F1B achieves near-identical convergence to sequential training. Hyperparameters transferred between configurations.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT