Multi-Token Prediction and Parallel Decoding

Multi-Token Prediction and Parallel Decoding are inference acceleration techniques that generate multiple tokens per forward pass instead of the standard one-token-at-a-time autoregressive decoding — including speculative decoding (draft-verify), Medusa heads (parallel prediction heads), and lookahead decoding, achieving 2-5× faster generation while maintaining output quality identical or near-identical to vanilla autoregressive decoding.

The Autoregressive Bottleneck

``Standard decoding: 1 token per forward pass For 1000-token response: 1000 sequential LLM forward passes Each pass is memory-bandwidth limited (loading all model weights) GPU compute utilization: often <30% during decoding Goal: Generate K tokens per forward pass → K× speedup potential`

Speculative Decoding (Draft-then-Verify)

`1. Draft: Small fast model generates K candidate tokens quickly Draft model: 10× smaller (e.g., 1B drafting for 70B) 2. Verify: Large target model processes ALL K tokens in parallel (single forward pass with K draft tokens prepended) Compare: target probabilities vs. draft probabilities 3. Accept/Reject: Accept consecutive tokens that match (using rejection sampling to guarantee identical distribution) Typically accept 2-5 tokens per verification step

# Mathematically exact: output distribution = target model distribution # Speedup ∝ acceptance rate × (K / overhead of draft + verify) # Practical: 2-3× speedup`

Medusa (Multiple Decoding Heads)

`Add K extra prediction heads to the base model: Head 0 (original): predicts token at position t+1 Head 1 (new): predicts token at position t+2 Head 2 (new): predicts token at position t+3 ... Head K (new): predicts token at position t+K+1

Each head is a small MLP (1-2 layers) trained on next-token prediction

Generation: 1. Forward pass → get top-k candidates from each head 2. Construct a tree of candidate sequences 3. Verify all candidates in parallel using tree attention 4. Accept longest valid prefix`

Medusa advantages: no draft model needed, heads are tiny (<1% extra parameters), and can be trained with a few hours of fine-tuning on the original model's training data.

Multi-Token Prediction (Training Objective)

Meta's multi-token prediction (2024) trains the model to predict the NEXT K tokens simultaneously:

`Standard: P(x_{t+1} | x_{1:t}) (predict 1 token) Multi: P(x_{t+1}, x_{t+2}, ..., x_{t+K} | x_{1:t}) (predict K tokens)

Implementation: shared backbone → K independent output heads Training loss: sum of K next-token-prediction losses

Benefits beyond speed: - Forces model to plan ahead (better representations) - Stronger performance on code and reasoning benchmarks - Can be used for parallel decoding at inference`

Lookahead Decoding

Uses the model itself as the draft source via Jacobi iteration:`Initialize: guess future tokens (e.g., random or n-gram based) Iterate: each forward pass refines ALL guessed tokens in parallel Convergence: fixed point where all positions are self-consistent N-gram cache: store and reuse verified n-gram patterns``
No separate draft model needed, works with any model.

Comparison

| Method | Speedup | Extra Params | Exact Output? | Requirements |
|--------|---------|-------------|---------------|-------------|
| Speculative (Leviathan) | 2-3× | Draft model | Yes | Compatible draft model |
| Medusa | 2-3× | <1% extra | Near-exact | Fine-tune heads |
| Multi-token (Meta) | 2-3× | K output heads | Yes (if trained) | Retrain from scratch |
| Lookahead | 1.5-2× | None | Near-exact | Nothing |
| Eagle | 2-4× | 0.5B extra | Yes | Train autoregression head |

Multi-token prediction and parallel decoding are transforming LLM inference economics — by exploiting the memory-bandwidth bottleneck of autoregressive generation (GPU compute is underutilized during single-token decoding), these techniques recover wasted compute capacity to generate multiple tokens per pass, achieving multiplicative speedups essential for cost-effective LLM serving at scale.

Multi-Token Prediction and Parallel Decoding

Want to learn more?