multi query attention

Keywords: multi query attention,mqa,efficient

Multi-Query Attention (MQA) is an efficient attention variant that uses a single shared key-value head across all query heads, dramatically reducing KV cache memory requirements and accelerating inference. Standard multi-head attention has separate key and value projections for each head, causing KV cache to grow linearly with the number of heads. MQA shares one key-value pair across all query heads, reducing KV cache size by the number of heads (typically 8-32×). This enables larger batch sizes, longer sequences, and faster inference, particularly for autoregressive generation where KV cache dominates memory. The quality impact is minimal—MQA models achieve similar performance to multi-head attention after training. Grouped-Query Attention (GQA) provides a middle ground, using multiple KV heads (but fewer than query heads) to balance quality and efficiency. MQA is particularly valuable for inference serving where memory bandwidth is the bottleneck. The technique has been adopted in models like PaLM, Falcon, and Llama-2. MQA represents a key optimization for practical LLM deployment.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT