Attention Mechanisms Beyond Vanilla (Multi-Head, Multi-Query, Grouped-Query, Sliding Window)

Attention Mechanisms Beyond Vanilla (Multi-Head, Multi-Query, Grouped-Query, Sliding Window) is the evolution of transformer attention from the original scaled dot-product formulation to specialized variants that improve computational efficiency, memory usage, and long-context handling — with each variant making different tradeoffs between representational capacity and inference speed.

Vanilla Scaled Dot-Product Attention

The foundational attention mechanism computes $ ext{Attention}(Q,K,V) = ext{softmax}(frac{QK^T}{sqrt{d_k}})V$ where queries (Q), keys (K), and values (V) are linear projections of input embeddings. Computational complexity is O(n²d) where n is sequence length and d is head dimension. Memory for storing the full attention matrix scales as O(n²), becoming the primary bottleneck for long sequences. The softmax operation creates a probability distribution over all positions, enabling global context aggregation.

Multi-Head Attention (MHA)

- Parallel heads: Input is projected into h parallel attention heads, each with dimension d_k = d_model/h (typically h=32, d_k=128 for large models)
- Diverse representations: Each head can attend to different positions and learn different relationship types (syntactic, semantic, positional)
- Concatenation: Head outputs are concatenated and projected through a linear layer to produce the final output
- KV cache: During autoregressive inference, past key/value pairs for all heads are cached, consuming memory proportional to batch_size × n_heads × seq_len × d_k × 2
- Standard usage: Used in the original Transformer, BERT, GPT-2, and GPT-3

Multi-Query Attention (MQA)

- Shared KV projections: All attention heads share a single set of key and value projections while maintaining separate query projections
- Memory reduction: KV cache size reduced by factor of h (number of heads)—critical for high-throughput inference serving
- Speed improvement: 3-10x faster inference with minimal quality degradation (typically <1% accuracy loss)
- Adoption: Used in PaLM, Falcon, and StarCoder models
- Trade-off: Slight reduction in model capacity due to shared representations, partially offset by faster training throughput enabling more tokens processed

Grouped-Query Attention (GQA)

- Balanced approach: Keys and values are shared within groups of heads rather than all heads or no heads
- Group count: Typically 8 KV groups for 32 query heads (each KV group serves 4 query heads)
- Performance: Achieves near-MHA quality with near-MQA efficiency—the best practical compromise
- Adoption: LLaMA 2 (70B), Mistral, LLaMA 3, and most modern LLMs use GQA
- Uptraining from MHA: Existing MHA models can be converted to GQA by mean-pooling adjacent KV heads and brief fine-tuning (5% of pretraining compute)

Sliding Window Attention (SWA)

- Local attention: Each token attends only to a fixed window of w surrounding tokens rather than the full sequence
- Linear complexity: Computation scales as O(n × w) instead of O(n²), enabling processing of very long sequences
- Information propagation: With L layers and window size w, information can propagate L × w positions through the network—sufficient for most tasks with adequate depth
- Mistral and Mixtral: Use sliding window attention with w=4096 combined with full attention in selected layers
- Longformer pattern: Combines sliding window (local) with global attention tokens (e.g., [CLS] token attends to all positions) for tasks requiring global context

Flash Attention and Hardware-Aware Implementations

- IO-aware algorithm: FlashAttention (Dao, 2022) computes exact attention without materializing the O(n²) attention matrix by tiling computation to fit in SRAM
- Speedup: 2-4x faster than standard attention and uses O(n) memory instead of O(n²)
- FlashAttention-2: Improved parallelism across sequence length and better work partitioning between CUDA warps, achieving 50-73% of theoretical peak FLOPS
- FlashAttention-3: Leverages Hopper GPU features (TMA, FP8, warp specialization) for further speedup on H100s
- Universal adoption: Now the default attention implementation in PyTorch, HuggingFace Transformers, and all major training frameworks

Emerging Attention Variants

- Ring Attention: Distributes attention computation across multiple devices by passing KV blocks in a ring topology, enabling near-infinite context lengths
- Linear attention: Replaces softmax with kernel functions to achieve O(n) complexity but may sacrifice quality on tasks requiring precise attention patterns
- Differential attention: Computes attention as the difference between two softmax attention maps, reducing noise and improving signal extraction
- Multi-head latent attention (MLA): DeepSeek-V2's approach that jointly compresses KV into a low-rank latent space, reducing KV cache by 93% while maintaining quality

The evolution of attention mechanisms reflects the fundamental tension between model expressiveness and computational practicality, with modern variants like GQA and Flash Attention enabling trillion-parameter models to serve billions of users at interactive speeds.

Attention Mechanisms Beyond Vanilla (Multi-Head, Multi-Query, Grouped-Query, Sliding Window)

Want to learn more?