Repetition penalty is a decoding modification that reduces the probability of tokens that have already appeared in generated text — preventing the common failure mode where language models get stuck in loops, repeating the same phrases or patterns indefinitely.
What Is Repetition Penalty?
- Definition: Multiplicative reduction of previously seen token probabilities.
- Formula: logit_new = logit / penalty (if token seen).
- Parameters: penalty (1.0 = off, >1.0 = penalize, <1.0 = encourage).
- Scope: Applies to all tokens in context or just generated.
Why Repetition Occurs
- Self-Reinforcing: Generated text becomes context that influences next tokens.
- High Probability: Common phrases have high probability.
- Local Optima: Greedy decoding gets stuck.
- Training Data: Patterns from repetitive training text.
Example Problem:
Without penalty:
"I love AI. I love AI. I love AI. I love AI..."
With penalty:
"I love AI. It enables incredible applications,
from healthcare to creative writing..."
How It Works
Algorithm:
For each next token prediction:
1. Get logits from model
2. For each token that appeared in context:
- If logit > 0: logit = logit / penalty
- If logit < 0: logit = logit * penalty
3. Apply softmax
4. Sample or argmax
Implementation:
import torch
def apply_repetition_penalty(
logits: torch.Tensor,
input_ids: torch.Tensor,
penalty: float = 1.2
):
"""Apply repetition penalty to logits."""
# Get unique tokens that have appeared
unique_tokens = input_ids.unique()
for token_id in unique_tokens:
# Penalize both positive and negative logits correctly
if logits[token_id] > 0:
logits[token_id] = logits[token_id] / penalty
else:
logits[token_id] = logits[token_id] * penalty
return logits
Hugging Face Usage:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The weather today is", return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
repetition_penalty=1.2, # >1.0 penalizes repetition
do_sample=True,
top_p=0.92,
)
Related Techniques
No-Repeat N-gram:
outputs = model.generate(
**inputs,
no_repeat_ngram_size=3, # Block any 3-gram from repeating
)
# Effect: "the big red" can only appear once
Frequency/Presence Penalty (OpenAI-style):
# OpenAI API
response = openai.chat.completions.create(
model="gpt-4",
messages=[...],
frequency_penalty=0.5, # Based on count
presence_penalty=0.5, # Binary: appeared or not
)
# frequency_penalty: Stronger for more frequent tokens
# presence_penalty: Same penalty regardless of count
Comparison:
Technique | Mechanism
-------------------|----------------------------------
repetition_penalty | Multiplicative on seen tokens
frequency_penalty | Additive based on count
presence_penalty | Additive if seen at all
no_repeat_ngram | Hard block on n-gram sequences
Parameter Tuning
Guidelines:
Value | Effect
----------|----------------------------------
1.0 | No penalty (default/off)
1.1-1.2 | Light penalty (most uses)
1.2-1.5 | Moderate penalty
1.5-2.0 | Strong penalty
>2.0 | Very strong (may hurt quality)
By Use Case:
Use Case | repetition_penalty
---------------------|--------------------
Conversational | 1.1-1.2
Creative writing | 1.0-1.15
Technical writing | 1.15-1.3
Summarization | 1.1-1.2
Code generation | 1.0-1.1 (code repeats naturally)
Potential Issues
Issue | Mitigation
---------------------|----------------------------------
Over-penalizing | Use lower penalty value
Hurts coherence | Limit to generated tokens only
Blocks needed words | Use frequency_penalty instead
Affects stop words | Exclude common tokens from penalty
Repetition penalty is essential for usable text generation — without it, most sampling methods eventually produce repetitive output, making this simple modification a standard component of production generation pipelines.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.