Beam Search and Nucleus Sampling Decoding

Beam Search and Nucleus Sampling Decoding are complementary strategies for generating high-quality text from language models by balancing diversity and quality — beam search explores most-likely paths while nucleus sampling maintains coherence through probabilistic token selection from adaptive vocabulary.

Beam Search Algorithm:
- Multiple Hypotheses: maintaining B best partial sequences (beams) sorted by cumulative log probability — B=3-5 typical with diminishing returns beyond 10
- Expansion Step: extending each beam by one token, computing softmax over 50K vocabulary — O(B*V) complexity per step where V is vocabulary size
- Pruning: keeping only top B hypotheses from B×V candidates using priority queue — reduces memory from exponential to linear in B
- Length Normalization: dividing scores by sequence length^α (α=0.6-0.7) to prevent bias toward short sentences — prevents algorithm favoring 1-2 word outputs
- Coverage Penalty: penalizing repeated coverage of same input tokens (for encoder-decoder models like T5) — improves summary diversity

Beam Search Characteristics:
- Quality Improvement: 5-10 BLEU point improvement on machine translation vs greedy (e.g., 28.0→33.5 BLEU) — noticeable in benchmarks but marginal in human evaluation
- Computational Cost: B=5 increases latency 5x due to batch processing larger number of sequences — trading generation speed for slightly better quality
- Determinism: identical outputs given same seed, reproducible across runs — useful for testing but unsuitable for creative tasks
- Hallucination Rate: 40-60% reduction in factual errors compared to greedy on QA tasks — especially beneficial for knowledge-critical applications

Nucleus (Top-P) Sampling:
- Cumulative Probability: selecting smallest vocabulary subset with cumulative probability >P (P=0.9 typical) — dynamically sized vocabulary per token
- Sorted Selection: ranking tokens by probability, accumulating until threshold P crossed — adaptive vocabulary 20-200 tokens depending on distribution
- Sampling: uniformly sampling from nucleus subset then applying temperature scaling — introduces beneficial stochasticity
- Temperature Interaction: combining nucleus (P) with temperature T for fine-grained control — P=0.9, T=0.8 balances quality and diversity

Top-K Sampling Approach:
- Fixed Vocabulary: sampling only from top K highest probability tokens (K=40-50 typical) — prevents sampling from extremely low probability tokens
- Hyperparameter Sensitivity: K=10 produces very focused outputs, K=100 allows more diversity — requires manual tuning per application
- Computational Simplicity: partial sort identifying top K requires O(Klog(V)) vs full sort O(Vlog(V)) — marginal speedup compared to nucleus
- Comparison: nucleus sampling outperforms fixed top-K on diversity while maintaining quality (human preference 65-75% in studies)

Temperature Scaling Impact:
- T=0: greedy decoding selecting arg-max token — deterministic, prone to repetition
- T=0.7: sharp distribution sharpening rare tokens, reducing diversity — recommended for factual tasks (QA, summarization)
- T=1.0: no scaling, using model calibrated probabilities — baseline setting
- T=1.5: softened distribution emphasizing diversity — recommended for creative tasks (story generation, dialogue)

Practical Decoding Strategies:
- Repetition Penalty: dividing logit of previously generated tokens by penalty parameter (1.0-2.0) — prevents repetitive sequences common in nucleus sampling
- Length Penalty: decreasing future token logits as sequence grows — encourages longer generations (useful for minimum length requirements)
- Bad Words Filter: zeroing logits of inappropriate tokens before sampling — prevents toxic or off-topic outputs
- Constraint Satisfaction: modifying probabilities to steer toward particular semantic constraints (CommonSense reasoning, QA answer format)

Beam Search and Nucleus Sampling Decoding are complementary techniques — beam search providing quality improvements for deterministic tasks while nucleus sampling enables creative, diverse text generation for conversational and creative applications.

Beam Search and Nucleus Sampling Decoding

Want to learn more?