Top-p sampling (nucleus sampling) is a dynamic decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds threshold p — adapting the candidate pool size to the model's confidence, top-p produces diverse yet coherent text by including more options when uncertain and fewer when confident.
What Is Top-p Sampling?
- Definition: Sample from smallest token set with cumulative prob ≥ p.
- Mechanism: Sort by probability, include tokens until sum reaches p.
- Parameter: p (nucleus) typically 0.9-0.95.
- Property: Dynamic vocabulary size based on distribution shape.
Why Top-p Works
- Adaptive: Adjusts candidate pool to model confidence.
- Diverse: Allows multiple reasonable continuations.
- Coherent: Excludes low-probability nonsense tokens.
- Better than top-k: Handles varying distribution shapes.
Algorithm
Step-by-Step:
p = 0.9
Token probabilities (sorted):
"sat": 0.35
"jumped": 0.25
"ran": 0.20
"walked": 0.10
"flew": 0.05
"danced": 0.03
"swam": 0.02
Cumulative:
"sat": 0.35 (< 0.9, include)
"jumped": 0.60 (< 0.9, include)
"ran": 0.80 (< 0.9, include)
"walked": 0.90 (= 0.9, include)
"flew": 0.95 (> 0.9, stop)
Nucleus = {sat, jumped, ran, walked}
Sample from these 4 tokens (renormalized)
Visual Comparison:
Flat distribution (uncertain):
████ ███ ███ ██ ██ ██ █ █ █ █
^------------------------^
Many tokens in nucleus (diverse)
Peaked distribution (confident):
████████████ ██ █
^--------^
Few tokens in nucleus (focused)
Implementation
Basic Top-p:
import torch
import torch.nn.functional as F
def top_p_sample(logits, p=0.9, temperature=1.0):
# Apply temperature
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
# Sort probabilities
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# Compute cumulative probabilities
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Find cutoff index
cutoff_mask = cumulative_probs > p
# Shift mask to keep first token that exceeds p
cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()
cutoff_mask[..., 0] = False
# Zero out tokens beyond nucleus
sorted_probs[cutoff_mask] = 0
# Renormalize
sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
# Sample
sampled_index = torch.multinomial(sorted_probs, 1)
token = sorted_indices.gather(-1, sampled_index)
return token
Hugging Face:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The story begins", return_tensors="pt")
# Top-p sampling
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
top_p=0.92, # Nucleus threshold
temperature=0.8, # Optional temperature
top_k=0, # Disable top-k (use only top-p)
)
print(tokenizer.decode(outputs[0]))
Top-p vs. Top-k
Scenario | Top-k (k=50) | Top-p (p=0.9)
---------------------|-----------------|----------------
Flat distribution | Uses 50 tokens | Uses many tokens
Peaked distribution | Uses 50 tokens | Uses few tokens
Very confident | Still 50 tokens | Maybe 1-5 tokens
Very uncertain | Only 50 tokens | Maybe 100+ tokens
Why Top-p Is Often Better:
Top-k problems:
- k=50 too many for confident predictions
- k=50 too few for uncertain predictions
- Fixed k doesn't adapt
Top-p advantages:
- Adapts to distribution shape
- Confident = focused, uncertain = diverse
- Single intuitive parameter
Combining with Temperature
# Common combinations
# Creative writing
outputs = model.generate(top_p=0.95, temperature=1.0)
# Balanced
outputs = model.generate(top_p=0.92, temperature=0.8)
# More focused
outputs = model.generate(top_p=0.85, temperature=0.7)
# Very focused (almost greedy)
outputs = model.generate(top_p=0.5, temperature=0.5)
Parameter Guidelines
p Value | Effect | Use Case
----------|---------------------|------------------
0.99+ | Nearly full vocab | Maximum diversity
0.92-0.95 | Standard creative | Most applications
0.85-0.90 | More focused | Factual with variety
0.5-0.7 | Very focused | Near-deterministic
Top-p sampling is the default choice for quality text generation — by dynamically adjusting the candidate pool based on model confidence, it achieves the ideal balance between diversity and coherence that fixed methods like top-k cannot match.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.