Multi-Token Prediction and Parallel Decoding

Keywords: multi-token prediction, speculative decoding LLM, medusa heads, parallel decoding, lookahead decoding

Multi-Token Prediction and Parallel Decoding are inference acceleration techniques that generate multiple tokens per forward pass instead of the standard one-token-at-a-time autoregressive decoding — including speculative decoding (draft-verify), Medusa heads (parallel prediction heads), and lookahead decoding, achieving 2-5× faster generation while maintaining output quality identical or near-identical to vanilla autoregressive decoding.

The Autoregressive Bottleneck

``
Standard decoding: 1 token per forward pass
For 1000-token response: 1000 sequential LLM forward passes
Each pass is memory-bandwidth limited (loading all model weights)
GPU compute utilization: often <30% during decoding

Goal: Generate K tokens per forward pass → K× speedup potential
`

Speculative Decoding (Draft-then-Verify)

`
1. Draft: Small fast model generates K candidate tokens quickly
Draft model: 10× smaller (e.g., 1B drafting for 70B)

2. Verify: Large target model processes ALL K tokens in parallel
(single forward pass with K draft tokens prepended)
Compare: target probabilities vs. draft probabilities

3. Accept/Reject: Accept consecutive tokens that match
(using rejection sampling to guarantee identical distribution)
Typically accept 2-5 tokens per verification step

# Mathematically exact: output distribution = target model distribution
# Speedup ∝ acceptance rate × (K / overhead of draft + verify)
# Practical: 2-3× speedup
`

Medusa (Multiple Decoding Heads)

`
Add K extra prediction heads to the base model:
Head 0 (original): predicts token at position t+1
Head 1 (new): predicts token at position t+2
Head 2 (new): predicts token at position t+3
...
Head K (new): predicts token at position t+K+1

Each head is a small MLP (1-2 layers) trained on next-token prediction

Generation:
1. Forward pass → get top-k candidates from each head
2. Construct a tree of candidate sequences
3. Verify all candidates in parallel using tree attention
4. Accept longest valid prefix
`

Medusa advantages: no draft model needed, heads are tiny (<1% extra parameters), and can be trained with a few hours of fine-tuning on the original model's training data.

Multi-Token Prediction (Training Objective)

Meta's multi-token prediction (2024) trains the model to predict the NEXT K tokens simultaneously:

`
Standard: P(x_{t+1} | x_{1:t}) (predict 1 token)
Multi: P(x_{t+1}, x_{t+2}, ..., x_{t+K} | x_{1:t}) (predict K tokens)

Implementation: shared backbone → K independent output heads
Training loss: sum of K next-token-prediction losses

Benefits beyond speed:
- Forces model to plan ahead (better representations)
- Stronger performance on code and reasoning benchmarks
- Can be used for parallel decoding at inference
`

Lookahead Decoding

Uses the model itself as the draft source via Jacobi iteration:
`
Initialize: guess future tokens (e.g., random or n-gram based)
Iterate: each forward pass refines ALL guessed tokens in parallel
Convergence: fixed point where all positions are self-consistent
N-gram cache: store and reuse verified n-gram patterns
``
No separate draft model needed, works with any model.

Comparison

| Method | Speedup | Extra Params | Exact Output? | Requirements |
|--------|---------|-------------|---------------|-------------|
| Speculative (Leviathan) | 2-3× | Draft model | Yes | Compatible draft model |
| Medusa | 2-3× | <1% extra | Near-exact | Fine-tune heads |
| Multi-token (Meta) | 2-3× | K output heads | Yes (if trained) | Retrain from scratch |
| Lookahead | 1.5-2× | None | Near-exact | Nothing |
| Eagle | 2-4× | 0.5B extra | Yes | Train autoregression head |

Multi-token prediction and parallel decoding are transforming LLM inference economics — by exploiting the memory-bandwidth bottleneck of autoregressive generation (GPU compute is underutilized during single-token decoding), these techniques recover wasted compute capacity to generate multiple tokens per pass, achieving multiplicative speedups essential for cost-effective LLM serving at scale.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT