Induction Heads

Induction Heads are the specific two-layer attention head circuits in transformer models that implement pattern matching by searching for previously-seen context and predicting the token that followed it — identified as the mechanistic foundation of in-context learning and representing one of the most significant discoveries in mechanistic interpretability research.

What Are Induction Heads?

- Definition: A circuit consisting of two attention heads (one in layer 1, one in layer 2) that together implement the algorithm: "Search the current context for a previous occurrence of the current token, then predict the token that followed it."
- Pattern: Implements the rule [A][B]...[A] → predict [B]. If the model saw "Harry Potter" earlier, and now sees "Harry," it dramatically increases probability of "Potter."
- Discovery: Identified by Olsson et al. (Anthropic, 2022) in "In-context Learning and Induction Heads" — one of the first complete mechanistic accounts of a transformer capability.
- Universality: Induction heads form in virtually every transformer model trained on sequential prediction tasks — from 1-layer toy models to GPT-style production models.

Why Induction Heads Matter

- In-Context Learning Mechanism: Induction heads are the primary mechanism behind in-context learning (few-shot prompting) — demonstrating that this capability has a specific, identifiable mechanical implementation rather than being mysterious emergent behavior.
- Phase Transition: Induction heads form during a sudden phase transition in training — a specific training step where loss drops sharply and in-context learning ability appears. This phase transition is one of the clearest examples of capability emergence in neural network training.
- Universality: The fact that the same circuit forms independently in models of very different sizes and architectures demonstrates that transformers learn canonical algorithms — supporting the hope that interpretability findings generalize.
- Mechanistic Interpretability Proof of Concept: Induction heads demonstrated that it is possible to identify, understand, and formally describe a real computational mechanism inside a transformer — validating the mechanistic interpretability research program.

How Induction Heads Work — The Mechanism

The Two-Head Circuit:

Head 1 — Previous Token Head (layer L₁):
- Attends to the previous token in the sequence at each position.
- Copies information from position [t-1] to position [t].
- Creates a "shifted-by-one" key: K[t] contains information about token at position [t-1].

Head 2 — Induction Head (layer L₂, L₂ > L₁):
- Queries: "What token am I currently at?"
- Keys: Use output of Head 1 (shifted-by-one information).
- Match: Find positions where K[j] matches Q[t] — i.e., find where the token that preceded position j matches the current token at position t.
- Value: Copy the value at position j (the token that actually follows the matched position).
- Result: Attend to position [j] where token[j-1] = token[t], and predict token[j+1].

In-Context Few-Shot Learning:
- When given examples (input₁, output₁), (input₂, output₂), ..., (input_test, ?):
- Induction heads match input_test to previous inputs in context and copy the associated outputs.
- This is mechanistically why few-shot prompting works — the model's attention circuitry pattern-matches to provided examples and copies their associated outputs.

The Phase Transition

During transformer training, a clear phase transition occurs at a specific training step:
- Before: Model relies on unigram statistics (predict most common next tokens).
- During phase transition: Induction heads form in ~1 training step of rapid loss decrease.
- After: Model in-context learning improves dramatically; model tracks patterns within context window.

Evidence: Ablating the attention heads that form during the phase transition restores the pre-transition loss — confirming these heads causally produce the capability.

Induction Head Variants

- Fuzzy Induction Heads: Match not on exact token identity but on semantic similarity — predict tokens that follow semantically similar contexts.
- Multi-step Induction: Generalized circuits that implement longer-range pattern completion.
- Translation Heads: In multilingual models, heads that map between languages using analogous induction-like pattern matching.

Implications for AI Safety

- Emergent Capability Mechanism: Phase transitions in AI capability may generally correspond to the formation of specific circuits — not mystical emergence but identifiable mechanical changes.
- In-Context Learning = Circuit: The fact that ICL is implemented by identifiable attention heads means we can potentially modify, amplify, or suppress in-context learning through targeted intervention.
- Research Template: The induction head discovery established the methodological template for identifying circuits: activation patching → attention pattern analysis → weight inspection → formal algorithm reconstruction.

Induction heads are the Rosetta Stone of mechanistic interpretability — the first complete, formal account of a transformer capability that validated the entire research program of understanding neural networks as reverse-engineered algorithms rather than inscrutable black boxes, demonstrating that even seemingly mysterious capabilities like in-context learning have precise, understandable mechanical implementations.

Want to learn more?