State Space Models (SSMs) like Mamba are alternative architectures to transformers that process sequences with linear rather than quadratic complexity — using structured state spaces and selective mechanisms to achieve competitive quality with transformers while offering constant memory for long sequences and faster inference.
What Are State Space Models?
- Definition: Sequence models based on continuous state space equations.
- Complexity: O(n) vs. transformer's O(n²) in sequence length.
- Memory: Constant per token (no KV cache growth).
- Evolution: S4 (2022) → S5 → Mamba (2023) → Mamba-2.
Why SSMs Matter
- Long Context: Handle millions of tokens without memory explosion.
- Efficiency: Linear scaling enables very long sequences.
- Speed: Faster inference per token than transformers.
- Alternative Path: Different approach to scaling AI.
- Hardware Friendly: Linear recurrence maps well to hardware.
From Transformers to SSMs
Transformer Attention:
Attention: O(n²) compute, O(n) memory per layer
Every token attends to every other token
Quality: Excellent for most tasks
Problem: Doesn't scale to very long sequences
State Space Model:
SSM: O(n) compute, O(1) memory per layer
Information flows through hidden state
Update state with each new token
Challenge: Can it match transformer quality?
State Space Equations
Continuous Form:
h'(t) = Ah(t) + Bx(t) (state update)
y(t) = Ch(t) + Dx(t) (output)
Where:
- h: hidden state
- x: input
- y: output
- A, B, C, D: learned parameters
Discrete Form (for sequences):
h_t = Ā h_{t-1} + B̄ x_t
y_t = C h_t
Computed efficiently via parallel scan
Mamba: Selective State Spaces
Key Innovation:
- Make A, B, C input-dependent (selective).
- Model can choose what to remember/forget.
- Bridges RNN flexibility with SSM efficiency.
Mamba Block:
Input
↓
┌─────────────────────────────────────┐
│ Linear projection (expand dim) │
├─────────────────────────────────────┤
│ Conv1D (local context) │
├─────────────────────────────────────┤
│ Selective SSM │
│ - Input-dependent A, B, C │
│ - Selective scan (parallel) │
├─────────────────────────────────────┤
│ Linear projection (reduce dim) │
└─────────────────────────────────────┘
↓
Output
SSM vs. Transformer Comparison
Aspect | Transformer | Mamba/SSM
------------------|------------------|------------------
Complexity | O(n²) | O(n)
Memory | O(n) KV cache | O(1) state
Long context | Expensive | Cheap
In-context recall | Excellent | Good (improving)
Ecosystem | Mature | Emerging
Training | Parallel | Parallel (scan)
Inference | KV cache | RNN-style
Mamba Models
Model | Params | Performance
----------------|--------|----------------------------
Mamba-130M | 130M | Matches 350M transformer
Mamba-370M | 370M | Matches 1B transformer
Mamba-1.4B | 1.4B | Matches 3B transformer
Mamba-2.8B | 2.8B | Competitive with 7B
Jamba | 52B | Mamba + attention hybrid
Hybrid Architectures
Jamba (AI21):
- Mix Mamba and attention layers.
- Mamba handles long context cheaply.
- Attention provides in-context recall.
- Best of both worlds.
Mamba-2:
- Improved architecture and efficiency.
- Better parallelization.
- Closer to transformer quality.
Limitations
In-Context Learning:
- SSMs historically weaker at precise recall.
- Can't easily "lookup" specific earlier tokens.
- Mamba improves but may not fully match transformers.
Ecosystem:
- Fewer optimized kernels and tools.
- Less community support.
- Rapidly improving but not at transformer level.
Inference Frameworks
- mamba-ssm: Official implementation.
- causal-conv1d: Efficient convolution kernel.
- Triton kernels: Custom GPU kernels.
- vLLM: Adding Mamba support.
State Space Models are a promising alternative to transformers — while transformers dominate today, SSMs offer a fundamentally different approach with better theoretical scaling for long sequences, making them an important direction for future AI architectures.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.