Attention Mechanisms are the neural network operations that dynamically compute weighted combinations of value vectors based on query-key similarity โ enabling each element in a sequence to gather information from all other elements based on relevance, forming the computational core of transformer architectures and the single most impactful innovation in modern deep learning.
Scaled Dot-Product Attention
The fundamental operation: Attention(Q, K, V) = softmax(QKแต/โdโ)V
where Q (queries), K (keys), V (values) are linear projections of the input. The dot product QKแต computes pairwise similarity between all query-key pairs, softmax normalizes to a probability distribution, and the result weights the values. The โdโ scaling prevents attention scores from becoming extreme in high dimensions.
Multi-Head Attention
Instead of one attention function with d-dimensional keys, queries, and values, the computation splits into h parallel heads, each with dโ=d/h dimensions. Each head can attend to different aspects of the input (syntactic structure, semantic similarity, positional relationships). The concatenated head outputs are linearly projected to produce the final output.
Self-Attention vs. Cross-Attention
- Self-Attention: Q, K, V all derive from the same sequence. Each token attends to every other token in the same sequence. Used in encoder layers and decoder masked self-attention.
- Cross-Attention: Q comes from one sequence (decoder), K and V from another (encoder output). Enables the decoder to attend to relevant encoder positions. Used in encoder-decoder models, VLMs (text queries attend to visual features), and diffusion U-Nets (visual features attend to text conditioning).
- Causal (Masked) Attention: A mask prevents tokens from attending to future positions: attention_mask[i][j] = -โ for j > i. Essential for autoregressive generation.
KV Cache
During autoregressive inference, each new token only needs its own query vector โ the keys and values from all previous tokens are cached and reused. This reduces per-token computation from O(Nยฒ) to O(N) but requires O(N ร L ร d) memory that grows with sequence length. KV cache memory management is the primary bottleneck for long-context LLM serving.
Efficient Attention Variants
- Flash Attention: Fuses the attention computation into a single GPU kernel that operates on tiles of Q, K, V in SRAM, avoiding materialization of the NรN attention matrix in HBM. Reduces memory from O(Nยฒ) to O(N) and achieves 2-4x wall-clock speedup. The default attention implementation in all modern frameworks.
- Multi-Query Attention (MQA): All heads share a single K and V projection โ reduces KV cache size by hร with minor quality loss.
- Grouped-Query Attention (GQA): Groups of heads share K/V projections (e.g., 8 groups for 32 heads = 4x KV cache reduction). Used in LLaMA 2 70B, Mistral, and most production LLMs as the sweet spot between MHA and MQA.
Attention Mechanisms are the core computation that makes transformers transformers โ the dynamic, content-dependent information routing that replaced fixed convolution kernels and recurrent state updates with a universally flexible mechanism for relating any part of the input to any other.