Technique

Activation patching edits internal activations to understand the causal role of specific neurons, layers, or circuits. Technique: Run model on two inputs (clean and corrupted), at specific layer/position swap activations from clean run into corrupted run, measure if output changes. Causal interpretation: If patching activations restores correct behavior, those activations causally encode the relevant information. Path patching variant: Patch specific edge between components rather than full activation. Use cases: Identify which layer encodes specific features, find circuits responsible for behaviors, understand information flow, validate mechanistic hypotheses. Example: Patch subject token activations to see if model uses name information from those positions for next prediction. Tools: TransformerLens activation patching, custom PyTorch hooks. Relationship to interventions: Generalizes ablation studies to continuous interventions. Limitations: Computationally expensive (many patch combinations), interpretation requires expertise, may miss distributed representations. Key research: Used extensively in Anthropic's circuit analysis, IOI paper. Central technique in mechanistic interpretability research.

Want to learn more?