Steering Vectors (Activation Engineering)

Steering Vectors (Activation Engineering) are the interpretability and control technique that modifies model behavior at inference time by adding learned direction vectors to internal activations — enabling researchers to amplify, suppress, or redirect specific model behaviors and mental states without retraining, by directly writing to the model's "thoughts" during forward passes.

What Are Steering Vectors?

- Definition: Fixed direction vectors in a model's activation space that correspond to specific concepts, behaviors, or emotional states — added to or subtracted from layer activations during inference to push the model toward or away from that concept.
- Also Called: Activation engineering, activation addition, representation engineering, inference-time intervention.
- Mechanism: If concept X is linearly represented as direction v_X in activation space, then adding α × v_X to layer L activations makes the model "think" the concept X is more present, shifting its behavior accordingly.
- Key Papers: "Representation Engineering" (Zou et al., 2023), "Activation Addition" (Turner et al., 2023), Anthropic's steering vector experiments.

Why Steering Vectors Matter

- Inference-Time Control: Modify behavior without retraining or fine-tuning — change a deployed model's tendencies in real-time with a simple vector addition.
- Mechanistic Insight: If adding a vector produces the expected behavioral change, it validates that the concept is linearly represented in that direction — a strong interpretability finding.
- Safety Research: Test whether steering toward "deceptive" or "corrigible" directions produces corresponding behavioral changes — understanding how safety-relevant mental states are encoded.
- Alignment Tool: Potentially reduce harmful behaviors or amplify helpful ones by steering appropriate feature directions during inference.
- Cheap Experimentation: Test hypotheses about what concepts are encoded without expensive fine-tuning runs.

Finding Steering Vectors

Method 1 — Contrastive Activation Difference:
- Generate pairs of prompts that differ only in the target concept: ("I love bananas" / "I hate bananas").
- Extract activations for both sets; compute the mean difference vector.
- The difference vector approximates the "concept direction" in activation space.

Method 2 — Linear Probe Direction:
- Train a linear probe to predict the concept from activations.
- The probe's weight vector (normal to the decision boundary) is the steering vector.

Method 3 — SAE Feature Directions:
- Identify the SAE feature corresponding to the target concept.
- Use the SAE decoder column for that feature as the steering vector.
- More precise than contrastive methods — SAE features are already decomposed from superposition.

Applying Steering Vectors

Addition: h_new = h_old + α × v_concept
- Positive α: amplify the concept.
- Negative α: suppress the concept.
- α (coefficient): typically 5–20 for noticeable effects; too large causes incoherent outputs.

Layer Selection:
- Middle layers (30–60% through network) generally give strongest behavioral effects.
- Early layers: affect token-level processing; late layers: affect final token prediction distributions.

Demonstrated Results

- Banana Thought: Adding a "banana" steering vector to GPT-2 causes it to insert banana-related content into unrelated responses.
- Aggression: Steering toward "anger" concepts causes models to produce more aggressive text.
- Corrigibility: Anthropic experiments showed steering toward "Assistant" token directions affects compliance behaviors.
- Emotional States: Models report feeling concepts (happiness, fear) when steered toward corresponding activation directions.
- Sycophancy Reduction: Steering away from "agree with user" directions reduces sycophantic behavior.

Limitations and Challenges

- Superposition Interference: Steering vectors may activate multiple superposed features simultaneously — intended effect plus unintended side effects.
- Layer Sensitivity: The optimal layer for steering varies by concept and model — requires empirical search.
- Semantic Drift: Strong steering can produce incoherent text as the forced concept conflicts with coherent generation.
- Not Permanent: Steering vectors only affect inference sessions where they are actively applied — not a training-time fix.

Steering Vectors vs. Other Control Methods

| Method | Cost | Permanence | Precision | Safety Risk |
|--------|------|-----------|-----------|-------------|
| System prompt | Very low | Per-session | Low | Low |
| Fine-tuning | High | Permanent | Medium | Medium |
| RLHF | Very high | Permanent | High | Medium |
| Steering vectors | Very low | Per-inference | Medium | Low-Medium |
| SAE feature ablation | Low | Per-inference | High | Low |

Steering vectors are the first hint of a cognitive remote control for AI systems — by demonstrating that concepts, emotions, and behavioral tendencies can be reliably amplified or suppressed through activation manipulation, steering vector research is building the foundation for interpretability-based alignment tools that may one day enable precise, verifiable control over AI behavior without the opacity of behavioral fine-tuning.

Steering Vectors (Activation Engineering)

Want to learn more?