Home Knowledge Base Multi-Token Prediction and Parallel Decoding

Multi-Token Prediction and Parallel Decoding are inference acceleration techniques that generate multiple tokens per forward pass instead of the standard one-token-at-a-time autoregressive decoding — including speculative decoding (draft-verify), Medusa heads (parallel prediction heads), and lookahead decoding, achieving 2-5× faster generation while maintaining output quality identical or near-identical to vanilla autoregressive decoding.

The Autoregressive Bottleneck

Standard decoding: 1 token per forward pass
  For 1000-token response: 1000 sequential LLM forward passes
  Each pass is memory-bandwidth limited (loading all model weights)
  GPU compute utilization: often <30% during decoding
  
Goal: Generate K tokens per forward pass → K× speedup potential

Speculative Decoding (Draft-then-Verify)

1. Draft: Small fast model generates K candidate tokens quickly
   Draft model: 10× smaller (e.g., 1B drafting for 70B)
   
2. Verify: Large target model processes ALL K tokens in parallel
   (single forward pass with K draft tokens prepended)
   Compare: target probabilities vs. draft probabilities
   
3. Accept/Reject: Accept consecutive tokens that match
   (using rejection sampling to guarantee identical distribution)
   Typically accept 2-5 tokens per verification step

# Mathematically exact: output distribution = target model distribution
# Speedup ∝ acceptance rate × (K / overhead of draft + verify)
# Practical: 2-3× speedup

Medusa (Multiple Decoding Heads)

Add K extra prediction heads to the base model:
  Head 0 (original): predicts token at position t+1
  Head 1 (new):      predicts token at position t+2
  Head 2 (new):      predicts token at position t+3
  ...
  Head K (new):      predicts token at position t+K+1

Each head is a small MLP (1-2 layers) trained on next-token prediction

Generation:
  1. Forward pass → get top-k candidates from each head
  2. Construct a tree of candidate sequences
  3. Verify all candidates in parallel using tree attention
  4. Accept longest valid prefix

Medusa advantages: no draft model needed, heads are tiny (<1% extra parameters), and can be trained with a few hours of fine-tuning on the original model's training data.

Multi-Token Prediction (Training Objective)

Meta's multi-token prediction (2024) trains the model to predict the NEXT K tokens simultaneously:

Standard: P(x_{t+1} | x_{1:t})     (predict 1 token)
Multi:    P(x_{t+1}, x_{t+2}, ..., x_{t+K} | x_{1:t})  (predict K tokens)

Implementation: shared backbone → K independent output heads
Training loss: sum of K next-token-prediction losses

Benefits beyond speed:
  - Forces model to plan ahead (better representations)
  - Stronger performance on code and reasoning benchmarks
  - Can be used for parallel decoding at inference

Lookahead Decoding

Uses the model itself as the draft source via Jacobi iteration:

Initialize: guess future tokens (e.g., random or n-gram based)
Iterate: each forward pass refines ALL guessed tokens in parallel
Convergence: fixed point where all positions are self-consistent
N-gram cache: store and reuse verified n-gram patterns

No separate draft model needed, works with any model.

Comparison

MethodSpeedupExtra ParamsExact Output?Requirements
Speculative (Leviathan)2-3×Draft modelYesCompatible draft model
Medusa2-3×<1% extraNear-exactFine-tune heads
Multi-token (Meta)2-3×K output headsYes (if trained)Retrain from scratch
Lookahead1.5-2×NoneNear-exactNothing
Eagle2-4×0.5B extraYesTrain autoregression head

Multi-token prediction and parallel decoding are transforming LLM inference economics — by exploiting the memory-bandwidth bottleneck of autoregressive generation (GPU compute is underutilized during single-token decoding), these techniques recover wasted compute capacity to generate multiple tokens per pass, achieving multiplicative speedups essential for cost-effective LLM serving at scale.

multi-token predictionspeculative decoding LLMmedusa headsparallel decodinglookahead decoding

Related Topics

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.