Greedy decoding is the simplest text generation strategy that selects the highest probability token at each step — always choosing argmax of the output distribution, greedy decoding is fast and deterministic but can produce repetitive or suboptimal text by making locally optimal choices.
What Is Greedy Decoding?
- Definition: Select highest probability token at each step.
- Formula: y_t = argmax P(y | y_{
- Properties: Deterministic, fast, simple.
- Limitation: May miss globally better sequences.
Why Greedy Decoding
- Speed: Single forward pass per token.
- Simplicity: No hyperparameters to tune.
- Determinism: Same input always gives same output.
- Baseline: Reference point for other methods.
Algorithm
Step-by-Step:
Input: "The cat"
Step 1:
P(sat) = 0.3, P(jumped) = 0.2, P(ran) = 0.15...
Select: "sat" (highest)
Step 2: "The cat sat"
P(on) = 0.4, P(down) = 0.2, P(quietly) = 0.1...
Select: "on" (highest)
Step 3: "The cat sat on"
P(the) = 0.6, P(a) = 0.2...
Select: "the" (highest)
Continue until <eos> or max_length
Implementation
Basic Greedy:
import torch
def greedy_decode(model, input_ids, max_length=50):
generated = input_ids.clone()
for _ in range(max_length):
with torch.no_grad():
outputs = model(generated)
logits = outputs.logits[0, -1] # Last token probs
# Greedy: take argmax
next_token = logits.argmax(dim=-1)
# Stop if EOS
if next_token == eos_token_id:
break
# Append token
generated = torch.cat([generated, next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
return generated
Hugging Face:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Greedy decoding (default when num_beams=1, no sampling)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False, # No sampling = greedy
)
print(tokenizer.decode(outputs[0]))
Greedy Decoding Problems
Common Issues:
Problem | Example
--------------------|----------------------------------
Repetition | "I like dogs. I like dogs. I like..."
Generic text | "It is important to note that..."
Missed alternatives | Ignores good paths with lower first token
Lack of creativity | Same response patterns
Why Repetition Occurs:
If "word X" has high probability given context,
and generating "word X" creates similar context,
then "word X" becomes high probability again.
Loop: context → high P(X) → generate X → similar context → ...
Mitigations
Repetition Penalty:
outputs = model.generate(
**inputs,
do_sample=False,
repetition_penalty=1.2, # Reduce prob of seen tokens
no_repeat_ngram_size=3, # Block 3-gram repeats
)
Temperature (Makes It Sampling):
# Temperature doesn't affect argmax directly,
# but can be combined with top-k for diversity
outputs = model.generate(
**inputs,
do_sample=True,
temperature=0.7, # Now it's sampling, not greedy
)
Comparison with Other Methods
Method | Deterministic | Diverse | Quality
----------------|---------------|---------|--------
Greedy | Yes | No | Medium
Beam search | Yes | Low | High
Top-k sampling | No | High | Variable
Top-p sampling | No | High | Variable
When to Use Greedy
✅ Good For:
- Factual QA (single correct answer)
- Translation (beam search better)
- Code completion
- Fast inference
- Debugging/testing
❌ Avoid For:
- Creative writing
- Conversational AI
- Long-form generation
- When diversity matters
Greedy decoding is the simplest but often insufficient baseline — while fast and deterministic, its tendency toward repetition and local optima makes it unsuitable for most creative or conversational applications where beam search or sampling produces better results.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.