Mechanism

Beam search maintains multiple candidate sequences to find high-probability outputs. Mechanism: At each step, expand top-k hypotheses, score all continuations, keep top-k ("beam width") best sequences, continue until all beams reach end token. Hyperparameters: Beam width (typically 2-10), length normalization (prevent short sequence bias), early stopping (stop when top beam is complete). Trade-offs: Higher beam width → better quality but slower, O(k × vocab_size) per step. Length penalty: Score = log_prob / length^α, where α > 1 favors longer sequences. Diverse beam search: Add penalty for similar beams to encourage variety. Limitations: Computationally expensive, can produce generic/repetitive text for open-ended tasks, doesn't explore low-probability but interesting paths. Best use cases: Machine translation, summarization, structured outputs where quality matters more than diversity. When to avoid: Creative writing, chatbots, tasks needing diversity. Modern alternatives: Sampling often preferred for LLMs due to more natural outputs and lower compute.

Want to learn more?