Tensor Parallelism is the distributed deep learning strategy that partitions individual weight matrices across multiple GPUs within a single layer — splitting the computation of large matrix multiplications (the dominant operation in transformer models) across devices that communicate intermediate results via ultra-fast NVLink interconnects, enabling layers too wide for one GPU's memory while maintaining computational efficiency above 90%.
When Tensor Parallelism Is Needed
A transformer with hidden dimension 12,288 (GPT-3) has weight matrices of size 12,288 × 49,152 in each MLP layer — a single weight matrix occupying 2.4 GB in FP16. With 96 layers, the model parameters alone exceed 350 GB, far beyond any single GPU's memory. Tensor parallelism splits each matrix across T GPUs, so each GPU stores 1/T of the parameters and performs 1/T of the computation.
Megatron-LM Approach (Column and Row Partitioning)
For a two-layer MLP: Y = GeLU(XA) × B
1. Column-Parallel (First Layer): Matrix A is split column-wise across T GPUs. GPU i holds columns [i×k : (i+1)×k]. Each GPU independently computes Y_i = GeLU(X × A_i). No communication needed because GeLU is applied element-wise to independent output columns.
2. Row-Parallel (Second Layer): Matrix B is split row-wise across T GPUs. GPU i holds rows [i×k : (i+1)×k] and computes Z_i = Y_i × B_i (partial result). The final output Z = sum(Z_i) requires an allreduce across T GPUs.
Self-Attention Tensor Parallelism
Query, Key, and Value projections are split column-wise across GPUs (each GPU computes attention for a subset of attention heads). Since multi-head attention is independent per head, no communication is needed during the attention computation. Only the output projection (row-parallel) requires an allreduce.
Communication Cost
Each transformer layer requires 2 allreduce operations (one for MLP, one for attention), each communicating a tensor of size [batch × sequence × hidden_dim]. On NVLink (900 GB/s bidirectional on H100 NVSwitch), this takes:
- For hidden=12288, batch×seq=2048: 2 × 2048 × 12288 × 2 bytes = 100 MB per allreduce → ~0.1 ms at NVLink speed.
- Computation per layer: ~10-50 ms → communication overhead is 0.2-1.0%. Excellent efficiency.
Scaling Limits
Tensor parallelism is efficient only with ultra-fast interconnects (NVLink/NVSwitch within a node). Over slower interconnects (InfiniBand between nodes), the frequent per-layer allreduce becomes the bottleneck. Typical practice: T=4 or T=8 (within one DGX node) for tensor parallelism, combined with pipeline and data parallelism across nodes.
Tensor Parallelism is the intra-layer divide-and-conquer strategy that carves massive transformer layers into GPU-sized pieces — exploiting the mathematical structure of matrix multiplication to partition work with minimal communication overhead when connected by fast enough links.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.