Linear Probing is the diagnostic interpretability technique that trains a simple linear classifier on the frozen internal activations of a neural network to determine whether a specific concept is linearly represented in a given layer — revealing where and how information is encoded inside deep models without requiring access to training data or model weights.
What Is Linear Probing?
- Definition: Freeze a pre-trained neural network, extract activations from a specific internal layer for a dataset of examples, then train a simple linear classifier (logistic regression) on those activations to predict a target label — measuring whether the concept is "linearly separable" in that representation space.
- Hypothesis: If a neural network has learned to represent concept X in layer L, then the activation vectors at layer L should form linearly separable clusters corresponding to X — even though the network was never explicitly trained to predict X.
- Output: Classification accuracy of the linear probe — high accuracy indicates the concept is clearly represented in that layer; chance accuracy indicates the concept is not encoded there.
- Application: Understanding what information different layers encode, tracking how representations evolve across layers, and comparing what different architectures learn.
Why Linear Probing Matters
- Mechanistic Insight: Reveals the representational content of different network layers — "Layer 6 encodes syntactic information; Layer 12 encodes semantic content."
- Architecture Comparison: Compare what different pre-training objectives, datasets, or architectures learn to represent — does BERT layer 9 encode syntactic dependencies better than RoBERTa?
- Transfer Learning: Identify which layers contain representations most useful for downstream tasks — guides which layers to fine-tune vs. freeze for efficient transfer.
- Safety Applications: Probe for deceptive intent, harmful knowledge, or alignment-relevant representations — "Does layer 24 encode whether the model is being monitored?"
- Scientific Validation: Test whether models learn human-interpretable concepts (sentiment, syntax, entity type) rather than arbitrary statistical patterns.
The Probing Procedure
Step 1 — Dataset Preparation:
- Collect a dataset of examples with labels for the concept to probe (e.g., 1,000 sentences with positive/negative sentiment labels).
Step 2 — Activation Extraction:
- Run each example through the frozen target network.
- Save the activation vector at the layer(s) of interest.
- Typical: extract [CLS] token representation for BERT, or mean-pool all token representations.
Step 3 — Probe Training:
- Train logistic regression (or small MLP for harder concepts) to predict the concept label from the activation vectors.
- Use 80/20 train/test split; apply regularization (L2) to prevent overfitting to the probe itself.
Step 4 — Evaluation:
- Report probe accuracy on held-out test set.
- >80% accuracy: concept clearly encoded; 50–80%: partially encoded; ~chance: not encoded.
What Probes Have Discovered
- BERT Syntax: Lower layers (1–6) encode local syntactic structure (POS tags, dependency relations); upper layers encode semantic content.
- Part-of-Speech: Easily linearly separable in early transformer layers.
- Coreference: Encoded in middle layers — the model tracks which pronouns refer to which entities.
- Negation: Surprisingly hard to probe — models may not represent negation as a clean linear direction.
- World Knowledge: Entity properties (country of president, capital city) strongly encoded in middle-to-late layers of large LLMs.
Probing vs. Mechanistic Interpretability
| Aspect | Linear Probing | Mechanistic Interpretability |
|---|---|---|
| What it shows | Whether info is present | How the computation works |
| Depth | Surface representation | Algorithmic mechanism |
| Technique | Train classifier on activations | Circuit analysis, activation patching |
| Faithfulness | Representational | Causal / mechanistic |
| Computational cost | Low | High |
| Insight quality | Correlational | Causal |
Probing Pitfalls
- Probing Accuracy ≠ Model Usage: High probe accuracy means information is linearly accessible in activations — not that the model actually uses it for its predictions. The model may encode a concept but route it through different computations.
- Probe Capacity: A too-complex probe (large MLP) can extract information that the model has encoded in non-linear ways — inflating apparent concept encoding.
- Confounds: Probing for sentiment may actually probe for topic if the dataset is correlated — careful dataset construction required.
Linear probing is the X-ray of neural network representations — by projecting internal activations onto human-interpretable concepts, probing reveals the hidden geometry of learned representations and enables systematic comparison of what different architectures and training regimes choose to encode in their internal states.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.