Home Knowledge Base Top-p sampling

Top-p sampling (nucleus sampling) is a dynamic decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds threshold p — adapting the candidate pool size to the model's confidence, top-p produces diverse yet coherent text by including more options when uncertain and fewer when confident.

What Is Top-p Sampling?

Why Top-p Works

Algorithm

Step-by-Step:

p = 0.9

Token probabilities (sorted):
  "sat":     0.35
  "jumped":  0.25
  "ran":     0.20
  "walked":  0.10
  "flew":    0.05
  "danced":  0.03
  "swam":    0.02

Cumulative:
  "sat":     0.35 (< 0.9, include)
  "jumped":  0.60 (< 0.9, include)
  "ran":     0.80 (< 0.9, include)
  "walked":  0.90 (= 0.9, include)
  "flew":    0.95 (> 0.9, stop)

Nucleus = {sat, jumped, ran, walked}
Sample from these 4 tokens (renormalized)

Visual Comparison:

Flat distribution (uncertain):
  ████ ███ ███ ██ ██ ██ █ █ █ █
  ^------------------------^
  Many tokens in nucleus (diverse)

Peaked distribution (confident):
  ████████████ ██ █ 
  ^--------^
  Few tokens in nucleus (focused)

Implementation

Basic Top-p:

import torch
import torch.nn.functional as F

def top_p_sample(logits, p=0.9, temperature=1.0):
    # Apply temperature
    logits = logits / temperature
    probs = F.softmax(logits, dim=-1)
    
    # Sort probabilities
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    
    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Find cutoff index
    cutoff_mask = cumulative_probs > p
    # Shift mask to keep first token that exceeds p
    cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()
    cutoff_mask[..., 0] = False
    
    # Zero out tokens beyond nucleus
    sorted_probs[cutoff_mask] = 0
    
    # Renormalize
    sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
    
    # Sample
    sampled_index = torch.multinomial(sorted_probs, 1)
    token = sorted_indices.gather(-1, sampled_index)
    
    return token

Hugging Face:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("The story begins", return_tensors="pt")

# Top-p sampling
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    top_p=0.92,         # Nucleus threshold
    temperature=0.8,    # Optional temperature
    top_k=0,            # Disable top-k (use only top-p)
)

print(tokenizer.decode(outputs[0]))

Top-p vs. Top-k

Scenario             | Top-k (k=50)    | Top-p (p=0.9)
---------------------|-----------------|----------------
Flat distribution    | Uses 50 tokens  | Uses many tokens
Peaked distribution  | Uses 50 tokens  | Uses few tokens
Very confident       | Still 50 tokens | Maybe 1-5 tokens
Very uncertain       | Only 50 tokens  | Maybe 100+ tokens

Why Top-p Is Often Better:

Top-k problems:
- k=50 too many for confident predictions
- k=50 too few for uncertain predictions
- Fixed k doesn't adapt

Top-p advantages:
- Adapts to distribution shape
- Confident = focused, uncertain = diverse
- Single intuitive parameter

Combining with Temperature

# Common combinations
# Creative writing
outputs = model.generate(top_p=0.95, temperature=1.0)

# Balanced
outputs = model.generate(top_p=0.92, temperature=0.8)

# More focused
outputs = model.generate(top_p=0.85, temperature=0.7)

# Very focused (almost greedy)
outputs = model.generate(top_p=0.5, temperature=0.5)

Parameter Guidelines

p Value   | Effect              | Use Case
----------|---------------------|------------------
0.99+     | Nearly full vocab   | Maximum diversity
0.92-0.95 | Standard creative   | Most applications
0.85-0.90 | More focused        | Factual with variety
0.5-0.7   | Very focused        | Near-deterministic

Top-p sampling is the default choice for quality text generation — by dynamically adjusting the candidate pool based on model confidence, it achieves the ideal balance between diversity and coherence that fixed methods like top-k cannot match.

nucleus samplingtop pdynamictemperaturediversitygeneration

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.