Induction Heads are the specific two-layer attention head circuits in transformer models that implement pattern matching by searching for previously-seen context and predicting the token that followed it β identified as the mechanistic foundation of in-context learning and representing one of the most significant discoveries in mechanistic interpretability research.
What Are Induction Heads?
- Definition: A circuit consisting of two attention heads (one in layer 1, one in layer 2) that together implement the algorithm: "Search the current context for a previous occurrence of the current token, then predict the token that followed it."
- Pattern: Implements the rule [A][B]...[A] β predict [B]. If the model saw "Harry Potter" earlier, and now sees "Harry," it dramatically increases probability of "Potter."
- Discovery: Identified by Olsson et al. (Anthropic, 2022) in "In-context Learning and Induction Heads" β one of the first complete mechanistic accounts of a transformer capability.
- Universality: Induction heads form in virtually every transformer model trained on sequential prediction tasks β from 1-layer toy models to GPT-style production models.
Why Induction Heads Matter
- In-Context Learning Mechanism: Induction heads are the primary mechanism behind in-context learning (few-shot prompting) β demonstrating that this capability has a specific, identifiable mechanical implementation rather than being mysterious emergent behavior.
- Phase Transition: Induction heads form during a sudden phase transition in training β a specific training step where loss drops sharply and in-context learning ability appears. This phase transition is one of the clearest examples of capability emergence in neural network training.
- Universality: The fact that the same circuit forms independently in models of very different sizes and architectures demonstrates that transformers learn canonical algorithms β supporting the hope that interpretability findings generalize.
- Mechanistic Interpretability Proof of Concept: Induction heads demonstrated that it is possible to identify, understand, and formally describe a real computational mechanism inside a transformer β validating the mechanistic interpretability research program.
How Induction Heads Work β The Mechanism
The Two-Head Circuit:
Head 1 β Previous Token Head (layer Lβ):
- Attends to the previous token in the sequence at each position.
- Copies information from position [t-1] to position [t].
- Creates a "shifted-by-one" key: K[t] contains information about token at position [t-1].
Head 2 β Induction Head (layer Lβ, Lβ > Lβ):
- Queries: "What token am I currently at?"
- Keys: Use output of Head 1 (shifted-by-one information).
- Match: Find positions where K[j] matches Q[t] β i.e., find where the token that preceded position j matches the current token at position t.
- Value: Copy the value at position j (the token that actually follows the matched position).
- Result: Attend to position [j] where token[j-1] = token[t], and predict token[j+1].
In-Context Few-Shot Learning:
- When given examples (inputβ, outputβ), (inputβ, outputβ), ..., (input_test, ?):
- Induction heads match input_test to previous inputs in context and copy the associated outputs.
- This is mechanistically why few-shot prompting works β the model's attention circuitry pattern-matches to provided examples and copies their associated outputs.
The Phase Transition
During transformer training, a clear phase transition occurs at a specific training step:
- Before: Model relies on unigram statistics (predict most common next tokens).
- During phase transition: Induction heads form in ~1 training step of rapid loss decrease.
- After: Model in-context learning improves dramatically; model tracks patterns within context window.
Evidence: Ablating the attention heads that form during the phase transition restores the pre-transition loss β confirming these heads causally produce the capability.
Induction Head Variants
- Fuzzy Induction Heads: Match not on exact token identity but on semantic similarity β predict tokens that follow semantically similar contexts.
- Multi-step Induction: Generalized circuits that implement longer-range pattern completion.
- Translation Heads: In multilingual models, heads that map between languages using analogous induction-like pattern matching.
Implications for AI Safety
- Emergent Capability Mechanism: Phase transitions in AI capability may generally correspond to the formation of specific circuits β not mystical emergence but identifiable mechanical changes.
- In-Context Learning = Circuit: The fact that ICL is implemented by identifiable attention heads means we can potentially modify, amplify, or suppress in-context learning through targeted intervention.
- Research Template: The induction head discovery established the methodological template for identifying circuits: activation patching β attention pattern analysis β weight inspection β formal algorithm reconstruction.
Induction heads are the Rosetta Stone of mechanistic interpretability β the first complete, formal account of a transformer capability that validated the entire research program of understanding neural networks as reverse-engineered algorithms rather than inscrutable black boxes, demonstrating that even seemingly mysterious capabilities like in-context learning have precise, understandable mechanical implementations.