Definition

Beam search is a decoding algorithm that generates text by exploring multiple candidate sequences simultaneously. At each step, it expands all current sequences with possible next tokens, scores them, and keeps only the top-b highest-scoring candidates (where b is the beam width). This breadth-first approach often finds better overall sequences than greedy single-path decoding.

Why it matters

Beam search balances quality and computational cost:

Better sequences — explores multiple paths to find higher-probability outputs
Avoids local optima — doesn’t commit early to suboptimal token choices
Controllable exploration — beam width tunes quality vs. speed tradeoff
Translation standard — default in machine translation systems
Deterministic output — produces consistent results for same input

Beam search is essential for tasks where sequence-level quality matters.

How it works

┌────────────────────────────────────────────────────────────┐
│                       BEAM SEARCH                          │
│                      (beam width = 2)                      │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Step 0: Start with <start> token                          │
│                                                            │
│                      [<start>]                             │
│                          │                                 │
│  Step 1: Expand to vocabulary, keep top-2                  │
│                          │                                 │
│          ┌───────────────┴───────────────┐                 │
│          ▼                               ▼                 │
│      ["The"]                         ["A"]                 │
│      p=0.40                          p=0.35                │
│          │                               │                 │
│  Step 2: Expand each, score all, keep top-2                │
│          │                               │                 │
│    ┌─────┼─────┐                   ┌─────┼─────┐           │
│    ▼     ▼     ▼                   ▼     ▼     ▼           │
│  "The  "The  "The               "A     "A     "A           │
│   cat"  dog"  bird"              cat"   dog"   man"        │
│  0.38   0.35   0.32             0.30   0.28   0.25         │
│                                                            │
│  SCORES (cumulative log-prob):                             │
│  ─────────────────────────────                             │
│  "The cat"  = -0.97  ◄── KEEP (best)                      │
│  "The dog"  = -1.05  ◄── KEEP (2nd best)                  │
│  "A cat"    = -1.20      pruned                            │
│  "The bird" = -1.14      pruned                            │
│  ...                                                       │
│                                                            │
│  Step 3: Continue until <end> token...                     │
│                                                            │
│  ┌────────────────────────────────────────────────┐        │
│  │  BEAM WIDTH EFFECTS:                           │        │
│  │                                                │        │
│  │  b=1: Greedy decoding (fast, lower quality)   │        │
│  │  b=4-5: Common default (balanced)             │        │
│  │  b=10+: High quality (slower)                 │        │
│  └────────────────────────────────────────────────┘        │
│                                                            │
└────────────────────────────────────────────────────────────┘

Beam width comparison:

Width	Behavior	Trade-off
1	Greedy decoding	Fast but may miss better sequences
2-3	Minimal exploration	Slight quality gain
4-5	Standard	Good balance for most tasks
10+	Extensive	Diminishing returns, slower
50+	Exhaustive	Rarely needed

Common questions

Q: How is beam search different from sampling methods?

A: Beam search is deterministic—it always picks highest-scoring sequences. Sampling (top-k, top-p, temperature) introduces randomness for diversity. Beam search optimizes for probability; sampling trades some probability for variety and creativity.

Q: What’s a good beam width?

A: 4-5 is common for most tasks. Machine translation often uses 4-10. Higher values give diminishing returns while increasing computation. For creative writing, sampling methods are usually preferred over beam search.

Q: Does beam search guarantee finding the best sequence?

A: No—it’s an approximation. True best-first search is computationally infeasible for language models. Beam search can still miss the globally optimal sequence due to its fixed-width pruning.

Q: Why might beam search produce repetitive text?

A: Beam search optimizes for probability, and repeating common phrases can have high local probability. This is why length penalties and n-gram blocking are often added in practice.

Top-k Sampling — stochastic alternative
Top-p Sampling — adaptive sampling
Temperature — probability reshaping
Inference — generation process

References

Graves (2012), “Sequence Transduction with Recurrent Neural Networks”, arXiv. [2,000+ citations]

Freitag & Al-Onaizan (2017), “Beam Search Strategies for Neural Machine Translation”, WNMT. [500+ citations]

Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration”, ICLR. [2,500+ citations]

Meister et al. (2020), “If Beam Search is the Answer, What was the Question?”, EMNLP. [200+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References