Skip to main content
AI & Machine Learning

Greedy Decoding

A simple text generation strategy that always selects the highest-probability token at each step.

Also known as: Greedy search, Argmax decoding, Maximum likelihood decoding

Definition

Greedy decoding is the simplest text generation strategy where the model always selects the token with the highest probability at each generation step. It makes locally optimal choices without considering how current decisions affect future token possibilities, resulting in fast but potentially suboptimal sequences.

Why it matters

Greedy decoding offers key advantages in specific scenarios:

  • Speed — fastest decoding method, single forward pass per token
  • Deterministic — same input always produces same output
  • Simplicity — no hyperparameters to tune
  • Baseline — standard comparison point for other methods
  • Structured tasks — works well for factual, constrained outputs

However, greedy decoding often produces repetitive or generic text.

How it works

┌────────────────────────────────────────────────────────────┐
│                     GREEDY DECODING                        │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  At each step: Pick argmax(probability)                    │
│                                                            │
│  Step 1: "The" → probabilities:                           │
│  ┌─────────────────────────────────────────────┐           │
│  │  cat: 0.35  ◄── SELECTED (highest)         │           │
│  │  dog: 0.25                                  │           │
│  │  man: 0.15                                  │           │
│  │  car: 0.10                                  │           │
│  │  ...                                        │           │
│  └─────────────────────────────────────────────┘           │
│                                                            │
│  Step 2: "The cat" → probabilities:                       │
│  ┌─────────────────────────────────────────────┐           │
│  │  sat: 0.40  ◄── SELECTED (highest)         │           │
│  │  ran: 0.20                                  │           │
│  │  is:  0.18                                  │           │
│  │  was: 0.12                                  │           │
│  └─────────────────────────────────────────────┘           │
│                                                            │
│  Result: "The cat sat..."                                 │
│                                                            │
│  ┌────────────────────────────────────────────────┐        │
│  │  PROBLEM: LOCAL VS GLOBAL OPTIMA              │        │
│  │                                                │        │
│  │  Greedy: "The cat sat" (p=0.35 × 0.40 = 0.14)│        │
│  │  Better: "The dog ran" (p=0.25 × 0.55 = 0.14)│        │
│  │                                                │        │
│  │  Second path might lead to better sequence!   │        │
│  │  Greedy can't see this—commits to "cat"      │        │
│  └────────────────────────────────────────────────┘        │
│                                                            │
│  GREEDY VS ALTERNATIVES:                                   │
│  ────────────────────────                                  │
│  Greedy:     Pick top-1 always     → deterministic        │
│  Top-k:      Sample from top-k     → diverse              │
│  Top-p:      Sample from nucleus   → adaptive             │
│  Beam:       Track multiple paths  → better sequences     │
│                                                            │
└────────────────────────────────────────────────────────────┘

When to use greedy decoding:

ScenarioRecommendation
Code generationOften good (structured output)
TranslationUsually beam search preferred
Creative writingUse sampling instead
Factual Q&ACan work well
ClassificationAppropriate
General chatUse sampling

Common questions

Q: Why does greedy decoding produce repetitive text?

A: Once the model generates a common phrase, that phrase often has high probability of continuing. The model can get stuck in loops like “I think that I think that I think…” because each repetition is locally optimal.

Q: When should I use greedy decoding?

A: Use it for structured tasks with clear correct answers: code completion, classification, simple extraction. Avoid it for creative or open-ended generation where diversity matters.

Q: Is greedy decoding equivalent to temperature = 0?

A: Effectively yes. Temperature approaching 0 makes the probability distribution increasingly peaked at the highest-probability token, converging to greedy selection.

Q: How does greedy compare to beam search?

A: Greedy is beam search with beam width 1. Beam search explores multiple paths and often finds higher-probability complete sequences, at the cost of more computation.


References

Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration”, ICLR. [2,500+ citations]

Meister et al. (2020), “If Beam Search is the Answer, What was the Question?”, EMNLP. [200+ citations]

Welleck et al. (2020), “Neural Text Generation With Unlikelihood Training”, ICLR. [500+ citations]

See et al. (2017), “Get To The Point: Summarization with Pointer-Generator Networks”, ACL. [3,500+ citations]