Home Knowledge Base Decoding strategies

Decoding strategies are algorithms that determine how LLMs select the next token during text generation — from greedy selection of the most probable token to sampling-based methods like top-k and top-p that introduce controlled randomness, these strategies control the creativity, diversity, and quality of generated text.

What Are Decoding Strategies?

Why Decoding Strategy Matters

Decoding Methods

Greedy Decoding:

At each step, select: argmax(P(token|context))

Pros: Fast, deterministic, reproducible
Cons: Repetitive, misses better sequences, boring

Use: Testing, deterministic outputs needed

Beam Search:

Maintain top-k candidate sequences, expand all, keep best k

beam_width = 4:
Step 1: ["The", "A", "In", "It"]
Step 2: ["The cat", "The dog", "A cat", "A dog"]
...continue expanding and pruning...

Pros: Better than greedy, finds higher probability sequences
Cons: Still deterministic, expensive for long sequences

Use: Translation, summarization (shorter outputs)

Temperature Sampling:

Scale logits before softmax: softmax(logits / temperature)

Temperature = 1.0: Original distribution
Temperature < 1.0: Sharper (more deterministic)
Temperature > 1.0: Flatter (more random)
Temperature → 0: Approaches greedy
Temperature → ∞: Uniform random

Use: Primary creativity control knob

Top-K Sampling:

Only sample from top k highest probability tokens

Top-k = 50:
Original: [0.3, 0.2, 0.15, 0.1, 0.05, 0.05, ...]
Filtered: [0.3, 0.2, 0.15, 0.1, 0.05, ...] (top 50 only)
Renormalize and sample

Pros: Prevents sampling rare/nonsensical tokens
Cons: Fixed k may be too restrictive or permissive

Use: Good default with k=40-100

Top-P (Nucleus) Sampling:

Sample from smallest set of tokens with cumulative probability ≥ p

Top-p = 0.9:
Sorted: [0.4, 0.3, 0.15, 0.1, 0.03, 0.02, ...]
Cumsum: [0.4, 0.7, 0.85, 0.95] ← stop here (>0.9)
Sample from first 4 tokens only

Pros: Adapts to distribution shape
Cons: Can be very narrow for confident predictions

Use: Modern default, typically p=0.9-0.95

Combined Strategies

Modern LLM APIs typically combine:
1. Temperature scaling (creativity)
2. Top-p filtering (quality floor)
3. Top-k filtering (additional safety)
4. Repetition penalty (prevent loops)

Example:
temperature=0.7, top_p=0.9, top_k=50
→ Moderately creative, high quality outputs

Strategy Selection by Task

Task               | Strategy           | Settings
-------------------|--------------------|-----------------------
Factual QA         | Low temp or greedy | temp=0, or temp=0.1
Code generation    | Low temperature    | temp=0.2, top_p=0.95
Creative writing   | High temperature   | temp=0.9, top_p=0.95
Chat/dialogue      | Medium temperature | temp=0.7, top_p=0.9
Summarization      | Beam search        | beam=4, or temp=0.3
Brainstorming      | High temp, high p  | temp=1.0, top_p=0.95

Advanced Techniques

Repetition Penalty:

Contrastive Search:

Speculative Decoding:

Decoding strategies are the control panel for LLM generation behavior — understanding and tuning these parameters enables developers to match model outputs to task requirements, from deterministic factual responses to creative open-ended generation.

greedybeam searchdecodingsamplingtop-ktop-pnucleustemperaturegeneration

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.