Attention Mechanisms

Attention Mechanisms are the neural network operations that dynamically compute weighted combinations of value vectors based on query-key similarity — enabling each element in a sequence to gather information from all other elements based on relevance, forming the computational core of transformer architectures and the single most impactful innovation in modern deep learning.

Scaled Dot-Product Attention

The fundamental operation: Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V

where Q (queries), K (keys), V (values) are linear projections of the input. The dot product QKᵀ computes pairwise similarity between all query-key pairs, softmax normalizes to a probability distribution, and the result weights the values. The √dₖ scaling prevents attention scores from becoming extreme in high dimensions.

Multi-Head Attention

Instead of one attention function with d-dimensional keys, queries, and values, the computation splits into h parallel heads, each with dₖ=d/h dimensions. Each head can attend to different aspects of the input (syntactic structure, semantic similarity, positional relationships). The concatenated head outputs are linearly projected to produce the final output.

Self-Attention vs. Cross-Attention

- Self-Attention: Q, K, V all derive from the same sequence. Each token attends to every other token in the same sequence. Used in encoder layers and decoder masked self-attention.
- Cross-Attention: Q comes from one sequence (decoder), K and V from another (encoder output). Enables the decoder to attend to relevant encoder positions. Used in encoder-decoder models, VLMs (text queries attend to visual features), and diffusion U-Nets (visual features attend to text conditioning).
- Causal (Masked) Attention: A mask prevents tokens from attending to future positions: attention_mask[i][j] = -∞ for j > i. Essential for autoregressive generation.

KV Cache

During autoregressive inference, each new token only needs its own query vector — the keys and values from all previous tokens are cached and reused. This reduces per-token computation from O(N²) to O(N) but requires O(N × L × d) memory that grows with sequence length. KV cache memory management is the primary bottleneck for long-context LLM serving.

Efficient Attention Variants

- Flash Attention: Fuses the attention computation into a single GPU kernel that operates on tiles of Q, K, V in SRAM, avoiding materialization of the N×N attention matrix in HBM. Reduces memory from O(N²) to O(N) and achieves 2-4x wall-clock speedup. The default attention implementation in all modern frameworks.
- Multi-Query Attention (MQA): All heads share a single K and V projection — reduces KV cache size by h× with minor quality loss.
- Grouped-Query Attention (GQA): Groups of heads share K/V projections (e.g., 8 groups for 32 heads = 4x KV cache reduction). Used in LLaMA 2 70B, Mistral, and most production LLMs as the sweet spot between MHA and MQA.

Attention Mechanisms are the core computation that makes transformers transformers — the dynamic, content-dependent information routing that replaced fixed convolution kernels and recurrent state updates with a universally flexible mechanism for relating any part of the input to any other.

Want to learn more?