Beam Search and Nucleus Sampling Decoding

Keywords: beam search decoding,nucleus sampling,temperature control,top-k sampling,generation quality

Beam Search and Nucleus Sampling Decoding are complementary strategies for generating high-quality text from language models by balancing diversity and quality — beam search explores most-likely paths while nucleus sampling maintains coherence through probabilistic token selection from adaptive vocabulary.

Beam Search Algorithm:
- Multiple Hypotheses: maintaining B best partial sequences (beams) sorted by cumulative log probability — B=3-5 typical with diminishing returns beyond 10
- Expansion Step: extending each beam by one token, computing softmax over 50K vocabulary — O(B*V) complexity per step where V is vocabulary size
- Pruning: keeping only top B hypotheses from B×V candidates using priority queue — reduces memory from exponential to linear in B
- Length Normalization: dividing scores by sequence length^α (α=0.6-0.7) to prevent bias toward short sentences — prevents algorithm favoring 1-2 word outputs
- Coverage Penalty: penalizing repeated coverage of same input tokens (for encoder-decoder models like T5) — improves summary diversity

Beam Search Characteristics:
- Quality Improvement: 5-10 BLEU point improvement on machine translation vs greedy (e.g., 28.0→33.5 BLEU) — noticeable in benchmarks but marginal in human evaluation
- Computational Cost: B=5 increases latency 5x due to batch processing larger number of sequences — trading generation speed for slightly better quality
- Determinism: identical outputs given same seed, reproducible across runs — useful for testing but unsuitable for creative tasks
- Hallucination Rate: 40-60% reduction in factual errors compared to greedy on QA tasks — especially beneficial for knowledge-critical applications

Nucleus (Top-P) Sampling:
- Cumulative Probability: selecting smallest vocabulary subset with cumulative probability >P (P=0.9 typical) — dynamically sized vocabulary per token
- Sorted Selection: ranking tokens by probability, accumulating until threshold P crossed — adaptive vocabulary 20-200 tokens depending on distribution
- Sampling: uniformly sampling from nucleus subset then applying temperature scaling — introduces beneficial stochasticity
- Temperature Interaction: combining nucleus (P) with temperature T for fine-grained control — P=0.9, T=0.8 balances quality and diversity

Top-K Sampling Approach:
- Fixed Vocabulary: sampling only from top K highest probability tokens (K=40-50 typical) — prevents sampling from extremely low probability tokens
- Hyperparameter Sensitivity: K=10 produces very focused outputs, K=100 allows more diversity — requires manual tuning per application
- Computational Simplicity: partial sort identifying top K requires O(Klog(V)) vs full sort O(Vlog(V)) — marginal speedup compared to nucleus
- Comparison: nucleus sampling outperforms fixed top-K on diversity while maintaining quality (human preference 65-75% in studies)

Temperature Scaling Impact:
- T=0: greedy decoding selecting arg-max token — deterministic, prone to repetition
- T=0.7: sharp distribution sharpening rare tokens, reducing diversity — recommended for factual tasks (QA, summarization)
- T=1.0: no scaling, using model calibrated probabilities — baseline setting
- T=1.5: softened distribution emphasizing diversity — recommended for creative tasks (story generation, dialogue)

Practical Decoding Strategies:
- Repetition Penalty: dividing logit of previously generated tokens by penalty parameter (1.0-2.0) — prevents repetitive sequences common in nucleus sampling
- Length Penalty: decreasing future token logits as sequence grows — encourages longer generations (useful for minimum length requirements)
- Bad Words Filter: zeroing logits of inappropriate tokens before sampling — prevents toxic or off-topic outputs
- Constraint Satisfaction: modifying probabilities to steer toward particular semantic constraints (CommonSense reasoning, QA answer format)

Beam Search and Nucleus Sampling Decoding are complementary techniques — beam search providing quality improvements for deterministic tasks while nucleus sampling enables creative, diverse text generation for conversational and creative applications.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT