Multi-Query and Grouped Query Attention (GQA) are attention variants that share key-value representations across multiple query heads โ reducing KV cache memory by 8-16x and decoder-only inference latency by 25-40% while maintaining near-identical quality to standard multi-head attention.
Standard Multi-Head Attention Baseline:
- Head Structure: Q, K, V each split into h heads (h=32 for 1B models, h=96 for 70B) with dimension d_k = d_model/h
- Attention Computation: each head independently computes Attention(Q_i, K_i, V_i) = softmax(Q_iยทK_i^T/โd_k)ยทV_i
- Parameter Count: queries, keys, values each contain hรd_k = d_model parameters โ full matrix multiplications
- KV Cache Size: storing K, V for all previous tokens creates matrix [seq_len, h, d_k] โ 70B Llama with 32K context requires 78GB per batch
Multi-Query Attention (MQA) Architecture:
- Single KV Head: using single K, V across all Q heads: Attention(Q_i, K, V) where K, V โ โ^(seq_len ร d_k)
- Parameter Reduction: reducing K, V parameters from hรd_k to d_k โ 96x reduction for 96-head models
- KV Cache Reduction: memory from [seq_len, h, d_k] to [seq_len, d_k] โ 96x reduction (78GBโ0.8GB for 70B model)
- Quality Trade-off: 1-2% accuracy loss on benchmarks compared to standard attention โ minimal impact on downstream performance
- Inference Speedup: memory bandwidth bottleneck becomes compute-bound, latency 25-35% faster โ especially dramatic for long sequences
Grouped Query Attention (GQA) - Balanced Approach:
- Intermediate Grouping: using g query heads per key-value head (g=4-8 typical) instead of h heads
- Flexibility: scaling from MQA (g=1) to standard attention (g=h) with continuous parameter-quality trade-off
- Common Configurations: h=64 query heads, g=8 key-value heads (8x KV reduction) โ standard in Llama 2, Mistral models
- Quality Performance: with g=8, achieving 99.5% quality of standard attention while reducing KV cache 8x โ empirically better than MQA
- Adoption: Llama 2 70B uses GQA by default with 8-head groups โ production standard for modern models
Mathematical Formulation:
- GQA Attention: Attention(Q_{i,j}, K_i, V_i) where i โ [0, g), j โ [0, h/g) groups queries by key-value head
- Broadcasting: each of g key-value heads broadcasts to h/g query heads โ implemented as reshape and expand operations
- Gradient Flow: gradients from all query heads in group accumulate to single key-value head โ implicit head collaboration
- Attention Pattern: each key-value head attends to same token positions across all grouped query heads โ enables more expressive attention
Inference Optimization Impact:
- Memory Bandwidth: decoder latency bottleneck shifts from KV cache access (100GB/s bandwidth) to compute (312 TFLOPS peak)
- Batch Size Scaling: with MQA/GQA, batch size increases 8-16x before KV cache OOM โ servers handle 10x more concurrent requests
- Prefill-Decode Overlap: GQA enables more efficient pipeline overlap (prefill on compute cores, decode from cache) โ 30-50% throughput improvement
- Long Context: GQA enables 100K+ context windows on single GPU (Llama 2 Long on 80GB A100) โ infeasible with standard attention
Practical Deployment Benefits:
- Latency Reduction: 70B Llama 2 goes from 120ms to 80-90ms first-token latency with GQA โ critical for interactive applications
- Throughput: serving platform throughput increases from 50 req/s to 150-200 req/s per GPU โ 3-4x improvement
- Cost: fewer GPUs needed for same throughput (200โ50 GPUs for 1000 req/s) โ 75% cost reduction
- Mobile Deployment: GQA enables running 13B models on edge devices with KV cache fitting in 8GB DRAM
Model Architecture Adoption:
- Llama 2 Family: all models (7B, 13B, 70B) use GQA with g=8 groups โ standardized across Meta models
- Mistral 7B: uses GQA for efficiency, enabling strong performance with fewer parameters than Llama
- Falcon 40B: adopts GQA achieving Llama 70B quality with 40% fewer parameters
- GPT-style Models: OpenAI models still use standard attention (possibly using MQA internally) โ GQA benefits still untapped for API models
Advanced Techniques:
- Grouped Query with Recomputation: storing only g key-value heads, recomputing intermediate query-head values during backward pass โ reduces cache memory further
- Dynamic Head Grouping: adaptively grouping based on attention pattern sparsity per layer โ compute-aware optimization
- Cross-Attention Variants: applying GQA to encoder-decoder cross-attention for 4-8x reduction โ enables larger batch sizes in sequence-to-sequence models
- Hybrid Approaches: using GQA in early layers (lower precision) and standard attention in final layers โ balances quality and efficiency
Multi-Query and Grouped Query Attention are transforming LLM inference economics โ enabling practical deployment of large models through 8-16x KV cache reduction while maintaining 99%+ quality compared to standard multi-head attention.