Home Knowledge Base Greedy decoding

Greedy decoding is the simplest text generation strategy that selects the highest probability token at each step — always choosing argmax of the output distribution, greedy decoding is fast and deterministic but can produce repetitive or suboptimal text by making locally optimal choices.

What Is Greedy Decoding?

Why Greedy Decoding

Algorithm

Step-by-Step:

Input: "The cat"

Step 1:
  P(sat) = 0.3, P(jumped) = 0.2, P(ran) = 0.15...
  Select: "sat" (highest)

Step 2: "The cat sat"
  P(on) = 0.4, P(down) = 0.2, P(quietly) = 0.1...
  Select: "on" (highest)

Step 3: "The cat sat on"
  P(the) = 0.6, P(a) = 0.2...
  Select: "the" (highest)

Continue until <eos> or max_length

Implementation

Basic Greedy:

import torch

def greedy_decode(model, input_ids, max_length=50):
    generated = input_ids.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs.logits[0, -1]  # Last token probs
        
        # Greedy: take argmax
        next_token = logits.argmax(dim=-1)
        
        # Stop if EOS
        if next_token == eos_token_id:
            break
        
        # Append token
        generated = torch.cat([generated, next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
    
    return generated

Hugging Face:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("Once upon a time", return_tensors="pt")

# Greedy decoding (default when num_beams=1, no sampling)
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,  # No sampling = greedy
)

print(tokenizer.decode(outputs[0]))

Greedy Decoding Problems

Common Issues:

Problem             | Example
--------------------|----------------------------------
Repetition          | "I like dogs. I like dogs. I like..."
Generic text        | "It is important to note that..."
Missed alternatives | Ignores good paths with lower first token
Lack of creativity  | Same response patterns

Why Repetition Occurs:

If "word X" has high probability given context,
and generating "word X" creates similar context,
then "word X" becomes high probability again.

Loop: context → high P(X) → generate X → similar context → ...

Mitigations

Repetition Penalty:

outputs = model.generate(
    **inputs,
    do_sample=False,
    repetition_penalty=1.2,  # Reduce prob of seen tokens
    no_repeat_ngram_size=3,  # Block 3-gram repeats
)

Temperature (Makes It Sampling):

# Temperature doesn't affect argmax directly,
# but can be combined with top-k for diversity
outputs = model.generate(
    **inputs,
    do_sample=True,
    temperature=0.7,  # Now it's sampling, not greedy
)

Comparison with Other Methods

Method          | Deterministic | Diverse | Quality
----------------|---------------|---------|--------
Greedy          | Yes           | No      | Medium
Beam search     | Yes           | Low     | High
Top-k sampling  | No            | High    | Variable
Top-p sampling  | No            | High    | Variable

When to Use Greedy

✅ Good For:
- Factual QA (single correct answer)
- Translation (beam search better)
- Code completion
- Fast inference
- Debugging/testing

❌ Avoid For:
- Creative writing
- Conversational AI
- Long-form generation
- When diversity matters

Greedy decoding is the simplest but often insufficient baseline — while fast and deterministic, its tendency toward repetition and local optima makes it unsuitable for most creative or conversational applications where beam search or sampling produces better results.

greedy decodingargmaxdeterministicrepetitionsimple

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.