Context Parallelism

Context Parallelism is a distributed training and inference strategy that partitions long input sequences across multiple GPUs — enabling processing of context lengths (100K-1M+ tokens) that exceed single-device memory by distributing the sequence dimension rather than the model weights (tensor parallelism) or the batch dimension (data parallelism), with each device processing a portion of the sequence and communicating only for attention computations that span device boundaries.

What Is Context Parallelism?

- Definition: A parallelism strategy that splits the input sequence into chunks distributed across multiple devices — each device holds the full model weights but only processes a portion of the input sequence, with inter-device communication required specifically for attention operations where tokens on one device need to attend to tokens on another.
- The Problem: A single attention layer on a 1M-token sequence requires an attention matrix of 1M × 1M = 1 trillion entries. At FP16, that's 2TB of memory for ONE layer — no single GPU can hold this. Even 128K tokens requires ~32GB for the attention matrix alone.
- The Solution: Split the sequence across N devices. Each device computes attention for its chunk, communicating with other devices only when attention spans chunk boundaries.

Types of Parallelism Comparison

| Strategy | What Is Distributed | Communication Pattern | Best For |
|----------|-------------------|---------------------|----------|
| Data Parallelism | Different samples on each device | All-reduce gradients after backward pass | Large batch training |
| Tensor Parallelism | Model layers split across devices | All-reduce within each layer | Large model width |
| Pipeline Parallelism | Different layers on different devices | Forward/backward activation passing between stages | Very deep models |
| Context Parallelism | Different sequence positions on each device | Attention KV exchange between devices | Long sequences (100K+) |
| Expert Parallelism | Different MoE experts on different devices | All-to-all routing of tokens to experts | MoE architectures |

Context Parallelism Approaches

| Method | How It Works | Complexity | Communication |
|--------|-------------|-----------|--------------|
| Ring Attention | Devices arranged in ring; KV blocks circulated in passes | O(n²/p) per device | Ring all-reduce pattern |
| Sequence Parallelism (Megatron) | Split LayerNorm and Dropout along sequence dimension | Implementation-specific | All-gather / reduce-scatter |
| Striped Attention | Interleave sequence positions across devices (round-robin) | O(n²/p) per device | Better load balance for causal attention |
| Ulysses | Split along head dimension, redistribute for attention | O(n²/p) per device | All-to-all communication |

Ring Attention (Most Common)

| Step | Action | Communication |
|------|--------|--------------|
| 1. Each device holds one chunk of Q, K, V | Local chunk of sequence positions | None |
| 2. Compute local attention (Q_local × K_local) | Process local-to-local attention | None |
| 3. Pass K, V blocks to next device in ring | Receive K, V from previous device | Point-to-point send/recv |
| 4. Compute cross-attention (Q_local × K_received) | Accumulate attention from remote chunks | Concurrent with step 3 |
| 5. Repeat for P-1 passes (P = number of devices) | All Q-K pairs computed | Ring communication overlapped with compute |

Memory and Compute Scaling

| Devices | Sequence Per Device (1M total) | Attention Memory Per Device | Speedup |
|---------|-------------------------------|---------------------------|---------|
| 1 | 1M tokens | ~2TB (impossible) | 1× |
| 4 | 250K tokens | ~125GB | ~4× |
| 8 | 125K tokens | ~31GB | ~8× |
| 16 | 62.5K tokens | ~8GB (fits on one GPU) | ~16× |

Context Parallelism is the essential scaling strategy for long-context AI — splitting input sequences across multiple devices to overcome the quadratic memory requirements of attention, enabling models to process 100K-1M+ token contexts by distributing the sequence dimension with ring or striped communication patterns that overlap data transfer with computation for near-linear scaling.

Want to learn more?