Home Knowledge Base Activation Patching (Causal Tracing)

Activation Patching (Causal Tracing) is the mechanistic interpretability technique that identifies which specific components of a neural network causally store particular knowledge — by systematically replacing (patching) activations from one model run into another and observing whether the target behavior is restored, enabling precise attribution of model behaviors to specific layers, attention heads, and neurons.

What Is Activation Patching?

Why Activation Patching Matters

The Patching Procedure

Setup — Two Paired Prompts:

Step 1 — Clean Run:

Step 2 — Corrupted Run:

Step 3 — Patching Sweep:

Step 4 — Attribution Map:

Key Discoveries from Activation Patching

Factual Knowledge in MLPs (ROME, 2022):

Subject Token Amplification:

Induction Head Circuits:

Path Patching (Refined):

Activation Patching vs. Other Interpretability Methods

MethodTypeWhat It RevealsLimitation
ProbingRepresentationalWhat info is encodedNot causal
Attention vizCorrelationalWhere model attendsNot causal
Activation patchingCausalWhich components produce behaviorExpensive to run
AblationCausalWhat model loses without componentLess precise
Gradient attributionApproximateInput importanceNot mechanistic

Activation patching is the causal scalpel of mechanistic interpretability — by enabling precise, causal attribution of model behaviors to specific computational components rather than correlational patterns, patching transforms interpretability from observation into experimentation, enabling the kind of hypothesis testing that distinguishes genuine understanding from plausible storytelling.

activation patchingcausalintervention

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.