Home Knowledge Base Tensor Parallelism

Tensor Parallelism is the distributed deep learning strategy that partitions individual weight matrices across multiple GPUs within a single layer — splitting the computation of large matrix multiplications (the dominant operation in transformer models) across devices that communicate intermediate results via ultra-fast NVLink interconnects, enabling layers too wide for one GPU's memory while maintaining computational efficiency above 90%.

When Tensor Parallelism Is Needed

A transformer with hidden dimension 12,288 (GPT-3) has weight matrices of size 12,288 × 49,152 in each MLP layer — a single weight matrix occupying 2.4 GB in FP16. With 96 layers, the model parameters alone exceed 350 GB, far beyond any single GPU's memory. Tensor parallelism splits each matrix across T GPUs, so each GPU stores 1/T of the parameters and performs 1/T of the computation.

Megatron-LM Approach (Column and Row Partitioning)

For a two-layer MLP: Y = GeLU(XA) × B

1. Column-Parallel (First Layer): Matrix A is split column-wise across T GPUs. GPU i holds columns [i×k : (i+1)×k]. Each GPU independently computes Y_i = GeLU(X × A_i). No communication needed because GeLU is applied element-wise to independent output columns.

2. Row-Parallel (Second Layer): Matrix B is split row-wise across T GPUs. GPU i holds rows [i×k : (i+1)×k] and computes Z_i = Y_i × B_i (partial result). The final output Z = sum(Z_i) requires an allreduce across T GPUs.

Self-Attention Tensor Parallelism

Query, Key, and Value projections are split column-wise across GPUs (each GPU computes attention for a subset of attention heads). Since multi-head attention is independent per head, no communication is needed during the attention computation. Only the output projection (row-parallel) requires an allreduce.

Communication Cost

Each transformer layer requires 2 allreduce operations (one for MLP, one for attention), each communicating a tensor of size [batch × sequence × hidden_dim]. On NVLink (900 GB/s bidirectional on H100 NVSwitch), this takes:

Scaling Limits

Tensor parallelism is efficient only with ultra-fast interconnects (NVLink/NVSwitch within a node). Over slower interconnects (InfiniBand between nodes), the frequent per-layer allreduce becomes the bottleneck. Typical practice: T=4 or T=8 (within one DGX node) for tensor parallelism, combined with pipeline and data parallelism across nodes.

Tensor Parallelism is the intra-layer divide-and-conquer strategy that carves massive transformer layers into GPU-sized pieces — exploiting the mathematical structure of matrix multiplication to partition work with minimal communication overhead when connected by fast enough links.

tensor parallelism distributedmegatron tensor parallelmodel parallel column rowtensor parallel attentionintra layer parallelism

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.