Activation Patching (Causal Tracing) is the mechanistic interpretability technique that identifies which specific components of a neural network causally store particular knowledge — by systematically replacing (patching) activations from one model run into another and observing whether the target behavior is restored, enabling precise attribution of model behaviors to specific layers, attention heads, and neurons.
What Is Activation Patching?
- Definition: A causal intervention technique where activations computed during one forward pass (the "clean" run) are selectively injected into a different forward pass (the "corrupted" run) at specific components — measuring whether patching a component restores the corrupted model's correct behavior, identifying that component as causally responsible for the behavior.
- Also Called: Causal tracing, causal mediation analysis, interchange intervention.
- Publication: "Locating and Editing Factual Associations in GPT" (ROME paper) — Meng et al., MIT (2022). Demonstrated that factual knowledge is localized in specific MLP layers.
- Core Question: "Which neurons/attention heads/layers are causally necessary for this specific model behavior?"
Why Activation Patching Matters
- Causal vs. Correlational: Unlike probing (which finds where information is represented) or attention visualization (which shows where the model attends), activation patching reveals causal responsibility — which components actually produce the behavior.
- Knowledge Localization: Identify exactly which layers store specific factual associations — enabling targeted model editing without full retraining.
- Circuit Discovery: The core tool for identifying circuits — collections of components that jointly implement a specific algorithm.
- Debugging: Find exactly where incorrect reasoning or hallucinated facts originate in the computational graph.
- Model Editing: Knowledge of where facts are stored enables surgical editing of false beliefs (ROME, MEMIT model editing).
The Patching Procedure
Setup — Two Paired Prompts:
- Clean prompt: "The Eiffel Tower is located in [Paris]" — correct factual context.
- Corrupted prompt: "The Eiffel Tower is located in [Rome]" — incorrect context that leads to wrong output.
Step 1 — Clean Run:
- Forward pass on clean prompt; save all intermediate activations (every layer, every position, every component).
Step 2 — Corrupted Run:
- Forward pass on corrupted prompt; model outputs wrong token (e.g., "Rome").
Step 3 — Patching Sweep:
- For each component C (each layer × position × head combination):
- Replace activation at C during corrupted run with the saved activation from the clean run.
- Measure whether the model output shifts toward the correct token ("Paris").
- Record the "recovered probability" — how much of the correct behavior was restored.
Step 4 — Attribution Map:
- Components with high recovered probability are causally responsible for the target knowledge.
- Plot as a heatmap: layer × token position → recovered probability.
Key Discoveries from Activation Patching
Factual Knowledge in MLPs (ROME, 2022):
- Factual associations (Eiffel Tower → Paris) are stored in specific MLP layers in the middle of the network.
- Early layers process the subject ("Eiffel Tower"); middle MLP layers "look up" the fact; late layers output it.
- This enabled ROME (Rank-One Model Editing) — surgically overwrite a specific MLP's key-value memory to change a factual belief.
Subject Token Amplification:
- Attention heads in early layers attend to and amplify the subject token's representation.
- Middle-layer MLPs then query this amplified subject representation to retrieve stored knowledge.
Induction Head Circuits:
- Activation patching identified specific attention head pairs that implement in-context copying (induction heads).
- Patching individual heads revealed their specific causal roles.
Path Patching (Refined):
- Standard patching replaces full activations (including effects from previous components).
- Path patching isolates specific information pathways by holding other components constant.
- More precise attribution of information flow through specific network paths.
Activation Patching vs. Other Interpretability Methods
| Method | Type | What It Reveals | Limitation |
|---|---|---|---|
| Probing | Representational | What info is encoded | Not causal |
| Attention viz | Correlational | Where model attends | Not causal |
| Activation patching | Causal | Which components produce behavior | Expensive to run |
| Ablation | Causal | What model loses without component | Less precise |
| Gradient attribution | Approximate | Input importance | Not mechanistic |
Activation patching is the causal scalpel of mechanistic interpretability — by enabling precise, causal attribution of model behaviors to specific computational components rather than correlational patterns, patching transforms interpretability from observation into experimentation, enabling the kind of hypothesis testing that distinguishes genuine understanding from plausible storytelling.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.