Tensor Parallelism for LLM Training is a sophisticated model parallelism approach that partitions weight matrices across multiple GPUs/TPUs, enabling training of trillion-parameter language models by distributing computation and memory load.
Column and Row Parallel Linear Layers
- Tensor Parallel Concept: Weight matrices (W) split across device axis (column or row), enabling parallel matrix multiplication without replicating activations.
- Column-Parallel Linear: W divided by output dimension (Y = A × W_col, split across GPUs). Each GPU computes partial output; all-reduce aggregates results.
- Row-Parallel Linear: W divided by input dimension. Each GPU computes partial activation independently; all-gather concatenates results for next layer.
- Mixed Partitioning: Alternating column→row layers reduces synchronization overhead vs all-column. Megatron-LM uses this pattern for optimal efficiency.
Attention Head Distribution
- Multi-Head Attention Parallelism: Attention heads (H heads, typically 96-320) split across tensor-parallel devices. Each device computes subset of attention heads.
- Query/Key/Value Projection Parallelism: Q/K/V projections use column-parallel layers. Attention computation distributed across heads.
- Attention Dot-Product: Each device computes (Q × K^T) for its subset of heads independently. Softmax applied per head, values weighted locally.
- Output Projection: Multi-head outputs concatenated (all-gather), then row-parallel projection aggregates before feeding to MLP.
Megatron-LM 1D/2D/3D Tensor Parallelism
- 1D Tensor Parallelism: Splits along single dimension (typically embedding or head dimension). Simple implementation but less scalable (synchronization barrier every layer).
- 2D Tensor Parallelism: Creates 2D process grid (N_layer × N_tensor). Reduces all-reduce overhead by pipelining across two dimensions. Megatron-LM sweet spot for 100-500 GPU clusters.
- 3D Tensor Parallelism: Combines tensor parallelism with pipeline and data parallelism. Specialized for extreme scales (>1000 GPUs). Complex scheduling, minimal synchronization overhead.
- Sequence Parallelism Extension: Splits along sequence dimension (for transformer auto-regressive generation). Reduces attention O(N²) memory complexity.
All-Reduce Communication Patterns
- All-Reduce Operation: Collective communication reducing across devices (summation typical in gradient averaging). Each device sends/receives partial results.
- Ring All-Reduce: Devices arranged in logical ring. Minimizes bandwidth requirement, tolerates network asymmetry. O(NP) communication steps for N data elements, P processes.
- Tree All-Reduce: Binary tree structure reduces latency to O(log P) hops. Requires bandwidth-saturated links (not always available in over-subscribed networks).
- NCCL (NVIDIA Collective Communications Library): Optimized all-reduce kernels, automatically selects best algorithm based on hardware topology and message size.
Activation Memory and Communication Trade-offs
- Activation Recomputation: Intermediate activations dropped after forward pass, recomputed during backward pass. Reduces memory by 50% but increases computation 33%.
- Tensor Parallel Memory: No activation replicas (unlike data parallelism). Memory scales as O(model_size / tensor_parallel_degree + batch_size).
- Communication vs Computation Ratio: All-reduce bandwidth requirement ~2× (send/receive) weight size per iteration. Optimized via asynchronous communication overlap.
- Network Saturation: Bandwidth-limited at scales >100 GPUs. Network topology (fat-tree, dragonfly) critical to avoiding communication bottleneck.
Efficiency and Scaling Characteristics
- Arithmetic Intensity: Each all-reduce involves O(model_size) bandwidth for O(model_size) computation. Arithmetic intensity ~ 1 FLOP/Byte (memory-bound).
- Scaling Law: Perfect scaling requires communication hidden behind computation. Overlapping communication with matrix multiplications maintains efficiency to ~64-128 GPU clusters.
- Diminishing Returns: Beyond tensor_parallel_degree ~64, synchronization overhead dominates. Hybrid 2D/3D parallelism required for 1000+ GPU training.
- Hyperparameter Tuning: Learning rate, batch size, gradient accumulation adjusted per parallelism configuration. Different configurations yield different convergence behavior.
tensor parallelism distributed llmmegatron tensor parallelcolumn row tensor splittensor parallel attention1d 2d tensor parallel
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.