Multi-Query and Grouped Query Attention (GQA) are attention variants that share key-value representations across multiple query heads — reducing KV cache memory by 8-16x and decoder-only inference latency by 25-40% while maintaining near-identical quality to standard multi-head attention.
Standard Multi-Head Attention Baseline:
- Head Structure: Q, K, V each split into h heads (h=32 for 1B models, h=96 for 70B) with dimension d_k = d_model/h
- Attention Computation: each head independently computes Attention(Q_i, K_i, V_i) = softmax(Q_i·K_i^T/√d_k)·V_i
- Parameter Count: queries, keys, values each contain h×d_k = d_model parameters — full matrix multiplications
- KV Cache Size: storing K, V for all previous tokens creates matrix [seq_len, h, d_k] — 70B Llama with 32K context requires 78GB per batch
Multi-Query Attention (MQA) Architecture:
- Single KV Head: using single K, V across all Q heads: Attention(Q_i, K, V) where K, V ∈ ℝ^(seq_len × d_k)
- Parameter Reduction: reducing K, V parameters from h×d_k to d_k — 96x reduction for 96-head models
- KV Cache Reduction: memory from [seq_len, h, d_k] to [seq_len, d_k] — 96x reduction (78GB→0.8GB for 70B model)
- Quality Trade-off: 1-2% accuracy loss on benchmarks compared to standard attention — minimal impact on downstream performance
- Inference Speedup: memory bandwidth bottleneck becomes compute-bound, latency 25-35% faster — especially dramatic for long sequences
Grouped Query Attention (GQA) - Balanced Approach:
- Intermediate Grouping: using g query heads per key-value head (g=4-8 typical) instead of h heads
- Flexibility: scaling from MQA (g=1) to standard attention (g=h) with continuous parameter-quality trade-off
- Common Configurations: h=64 query heads, g=8 key-value heads (8x KV reduction) — standard in Llama 2, Mistral models
- Quality Performance: with g=8, achieving 99.5% quality of standard attention while reducing KV cache 8x — empirically better than MQA
- Adoption: Llama 2 70B uses GQA by default with 8-head groups — production standard for modern models
Mathematical Formulation:
- GQA Attention: Attention(Q_{i,j}, K_i, V_i) where i ∈ [0, g), j ∈ [0, h/g) groups queries by key-value head
- Broadcasting: each of g key-value heads broadcasts to h/g query heads — implemented as reshape and expand operations
- Gradient Flow: gradients from all query heads in group accumulate to single key-value head — implicit head collaboration
- Attention Pattern: each key-value head attends to same token positions across all grouped query heads — enables more expressive attention
Inference Optimization Impact:
- Memory Bandwidth: decoder latency bottleneck shifts from KV cache access (100GB/s bandwidth) to compute (312 TFLOPS peak)
- Batch Size Scaling: with MQA/GQA, batch size increases 8-16x before KV cache OOM — servers handle 10x more concurrent requests
- Prefill-Decode Overlap: GQA enables more efficient pipeline overlap (prefill on compute cores, decode from cache) — 30-50% throughput improvement
- Long Context: GQA enables 100K+ context windows on single GPU (Llama 2 Long on 80GB A100) — infeasible with standard attention
Practical Deployment Benefits:
- Latency Reduction: 70B Llama 2 goes from 120ms to 80-90ms first-token latency with GQA — critical for interactive applications
- Throughput: serving platform throughput increases from 50 req/s to 150-200 req/s per GPU — 3-4x improvement
- Cost: fewer GPUs needed for same throughput (200→50 GPUs for 1000 req/s) — 75% cost reduction
- Mobile Deployment: GQA enables running 13B models on edge devices with KV cache fitting in 8GB DRAM
Model Architecture Adoption:
- Llama 2 Family: all models (7B, 13B, 70B) use GQA with g=8 groups — standardized across Meta models
- Mistral 7B: uses GQA for efficiency, enabling strong performance with fewer parameters than Llama
- Falcon 40B: adopts GQA achieving Llama 70B quality with 40% fewer parameters
- GPT-style Models: OpenAI models still use standard attention (possibly using MQA internally) — GQA benefits still untapped for API models
Advanced Techniques:
- Grouped Query with Recomputation: storing only g key-value heads, recomputing intermediate query-head values during backward pass — reduces cache memory further
- Dynamic Head Grouping: adaptively grouping based on attention pattern sparsity per layer — compute-aware optimization
- Cross-Attention Variants: applying GQA to encoder-decoder cross-attention for 4-8x reduction — enables larger batch sizes in sequence-to-sequence models
- Hybrid Approaches: using GQA in early layers (lower precision) and standard attention in final layers — balances quality and efficiency
Multi-Query and Grouped Query Attention are transforming LLM inference economics — enabling practical deployment of large models through 8-16x KV cache reduction while maintaining 99%+ quality compared to standard multi-head attention.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.