Decoding strategies are algorithms that determine how LLMs select the next token during text generation — from greedy selection of the most probable token to sampling-based methods like top-k and top-p that introduce controlled randomness, these strategies control the creativity, diversity, and quality of generated text.
What Are Decoding Strategies?
- Definition: Methods for selecting tokens from model output probabilities.
- Context: After LLM computes logits, how do we choose the next token?
- Trade-off: Determinism/quality vs. diversity/creativity.
- Control: Parameters like temperature, top-k, top-p tune behavior.
Why Decoding Strategy Matters
- Output Quality: Wrong strategy = repetitive or nonsensical text.
- Creativity Control: More randomness for creative writing, less for factual.
- Task Matching: Different tasks need different strategies.
- User Experience: Balance predictability with variability.
Decoding Methods
Greedy Decoding:
At each step, select: argmax(P(token|context))
Pros: Fast, deterministic, reproducible
Cons: Repetitive, misses better sequences, boring
Use: Testing, deterministic outputs needed
Beam Search:
Maintain top-k candidate sequences, expand all, keep best k
beam_width = 4:
Step 1: ["The", "A", "In", "It"]
Step 2: ["The cat", "The dog", "A cat", "A dog"]
...continue expanding and pruning...
Pros: Better than greedy, finds higher probability sequences
Cons: Still deterministic, expensive for long sequences
Use: Translation, summarization (shorter outputs)
Temperature Sampling:
Scale logits before softmax: softmax(logits / temperature)
Temperature = 1.0: Original distribution
Temperature < 1.0: Sharper (more deterministic)
Temperature > 1.0: Flatter (more random)
Temperature → 0: Approaches greedy
Temperature → ∞: Uniform random
Use: Primary creativity control knob
Top-K Sampling:
Only sample from top k highest probability tokens
Top-k = 50:
Original: [0.3, 0.2, 0.15, 0.1, 0.05, 0.05, ...]
Filtered: [0.3, 0.2, 0.15, 0.1, 0.05, ...] (top 50 only)
Renormalize and sample
Pros: Prevents sampling rare/nonsensical tokens
Cons: Fixed k may be too restrictive or permissive
Use: Good default with k=40-100
Top-P (Nucleus) Sampling:
Sample from smallest set of tokens with cumulative probability ≥ p
Top-p = 0.9:
Sorted: [0.4, 0.3, 0.15, 0.1, 0.03, 0.02, ...]
Cumsum: [0.4, 0.7, 0.85, 0.95] ← stop here (>0.9)
Sample from first 4 tokens only
Pros: Adapts to distribution shape
Cons: Can be very narrow for confident predictions
Use: Modern default, typically p=0.9-0.95
Combined Strategies
Modern LLM APIs typically combine:
1. Temperature scaling (creativity)
2. Top-p filtering (quality floor)
3. Top-k filtering (additional safety)
4. Repetition penalty (prevent loops)
Example:
temperature=0.7, top_p=0.9, top_k=50
→ Moderately creative, high quality outputs
Strategy Selection by Task
Task | Strategy | Settings
-------------------|--------------------|-----------------------
Factual QA | Low temp or greedy | temp=0, or temp=0.1
Code generation | Low temperature | temp=0.2, top_p=0.95
Creative writing | High temperature | temp=0.9, top_p=0.95
Chat/dialogue | Medium temperature | temp=0.7, top_p=0.9
Summarization | Beam search | beam=4, or temp=0.3
Brainstorming | High temp, high p | temp=1.0, top_p=0.95
Advanced Techniques
Repetition Penalty:
- Reduce probability of recently generated tokens.
- Prevents phrase and word repetition.
- Parameter: presence_penalty, frequency_penalty.
Contrastive Search:
- Balance probability with diversity from previous tokens.
- Reduces degeneration without pure sampling.
Speculative Decoding:
- Draft model generates candidates quickly.
- Main model verifies in parallel.
- Speeds up generation, same distribution.
Decoding strategies are the control panel for LLM generation behavior — understanding and tuning these parameters enables developers to match model outputs to task requirements, from deterministic factual responses to creative open-ended generation.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.