Superposition is the phenomenon where neural networks represent more features (concepts) than they have dimensions by encoding them as overlapping, nearly-orthogonal directions in activation space ā explaining why individual neurons are polysemantic (responding to multiple unrelated concepts) and why direct neuron-level interpretability is so difficult in large models.
What Is Superposition?
- Definition: The strategy neural networks use to store N features in a d-dimensional space where N >> d ā by placing feature vectors at nearly-orthogonal angles in high-dimensional space such that they minimally interfere with each other during computation.
- Polysemanticity: The observable consequence of superposition ā individual neurons activate for multiple unrelated concepts because multiple features share the same neuron as part of their overlapping representation.
- Key Paper: "Toy Models of Superposition" ā Elhage et al., Anthropic (2022) ā formal mathematical analysis of when and why superposition occurs.
- Example: Neuron #4,721 in GPT-2 activates for bananas, the Eiffel Tower, and references to the number 17 ā seemingly unrelated, but each concept's feature vector happens to have a positive component along neuron #4,721's direction.
Why Superposition Matters
- Interpretability Challenge: If neurons are polysemantic, we cannot simply label each neuron with a single concept and call the network understood ā the basic unit of neural network analysis becomes uninterpretable.
- Explains Mysterious Scaling: As models get larger, they don't just represent more features ā they represent exponentially more features through denser superposition, partly explaining why scale produces unexpected capabilities.
- SAE Motivation: Superposition is exactly the problem sparse autoencoders solve ā by projecting to higher-dimensional spaces with sparsity constraints, SAEs disentangle the overlapping feature representations.
- Feature Competition: During training, features compete for dimensional 'slots' ā less important features are pushed into more oblique directions, increasing interference. This is why some concepts are harder for models to represent cleanly.
- Safety Implications: If dangerous capabilities are encoded in superposition with innocuous ones, safety interventions might inadvertently affect unrelated behaviors, or vice versa.
The Mathematics of Superposition
In a d-dimensional space with N features (N >> d):
- Perfect orthogonality: Can store at most d features with zero interference.
- Near-orthogonality: Can store N >> d features with small interference ε between feature pairs.
- In high dimensions (d = 1,000), we can store N ~ d² features with manageable interference using random near-orthogonal vectors.
When Does Superposition Occur?
Neural networks "choose" superposition based on the cost-benefit analysis:
- Benefit: Store more features ā better predictions on diverse inputs.
- Cost: Interference between features ā errors when features co-activate.
Superposition is preferred when:
- Features are sparse (rarely active) ā interference cost is low if features rarely co-activate.
- Features are important ā high-value features get dedicated dimensions; low-importance features share.
- Capacity is constrained ā smaller networks must superpose more aggressively.
Toy Model Demonstration
Anthropic trained a simple model (5 inputs ā 2D ā 5 outputs) and found:
- With few important features: each gets a dedicated dimension (no superposition).
- As features multiply: model packs them into a pentagonal arrangement in 2D ā 5 features in 2 dimensions using near-orthogonal directions 72° apart.
- With many sparse features: dense superposition with many overlapping directions.
Polysemanticity in Practice
- Curve Detectors: Early vision CNN neurons are monosemantic ā each responds to a specific orientation of curve.
- Middle-Layer Neurons in LLMs: Highly polysemantic ā a single neuron responds to DNA sequences, legal language, and European cities.
- Residual Stream Superposition: The transformer residual stream is the most superposed representation ā different layers write different features to the same high-dimensional space.
Superposition vs. Monosemanticity
| Representation | Features per neuron | Interpretability | Information density |
|---------------|--------------------|-----------------|--------------------|
| Monosemantic | 1 | High | Low |
| Polysemantic (superposition) | Many | Low | High |
| SAE features | ~1 (decomposed) | High | Moderate |
Implications for Alignment and Safety
- Hidden Features: Important alignment-relevant features (deceptive intent, harmful knowledge) may be encoded in superposition with benign features ā hard to find, hard to remove.
- Steering Difficulty: Adding a steering vector for one feature may unintentionally activate other features sharing those neural directions.
- SAE as Solution: Sparse autoencoders decompose superposed representations into interpretable monosemantic features ā the current best tool for working with superposition in production models.
Superposition is the fundamental reason why neural networks are so difficult to interpret ā by revealing that the basic unit of neural computation (the neuron) is not the basic unit of representation (the feature), superposition theory reframes the interpretability challenge and motivates the entire research agenda of sparse autoencoders and mechanistic feature analysis.