Definition
Top-p sampling (also called nucleus sampling) is a text generation strategy that dynamically selects from the smallest possible set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k which uses a fixed count, top-p adapts to the confidence of the model—selecting fewer tokens when the model is confident, more when uncertain.
Why it matters
Top-p provides intelligent control over output diversity:
- Adaptive selection — adjusts candidate pool based on model confidence
- Quality balance — excludes low-probability tokens that cause incoherence
- Flexibility — works across different contexts without manual tuning
- Complementary — combines well with temperature for fine control
- Production standard — default sampling method in most LLM APIs
Top-p often produces more natural text than fixed top-k sampling.
How it works
┌────────────────────────────────────────────────────────────┐
│ TOP-P (NUCLEUS) SAMPLING │
├────────────────────────────────────────────────────────────┤
│ │
│ Token probabilities (sorted high to low): │
│ │
│ Token Prob Cumulative │
│ ───────────────────────────── │
│ "the" 0.35 0.35 │
│ "a" 0.25 0.60 │
│ "this" 0.15 0.75 │
│ "that" 0.10 0.85 ◄── p=0.9 threshold │
│ "one" 0.08 0.93 ◄── included (crosses 0.9) │
│ "some" 0.04 0.97 excluded │
│ "my" 0.02 0.99 excluded │
│ "your" 0.01 1.00 excluded │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ TOP-P = 0.9 │ │
│ │ │ │
│ │ Selected nucleus: [the, a, this, that, one] │ │
│ │ Sample from these 5 tokens only │ │
│ │ │ │
│ │ ████████████████████████░░░░░░░░ │ │
│ │ ▲ ▲ │ │
│ │ Included (93%) Excluded (7%) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ADAPTIVE BEHAVIOR: │
│ • Confident prediction → selects 2-3 tokens │
│ • Uncertain prediction → selects 10-20 tokens │
│ │
└────────────────────────────────────────────────────────────┘
Common top-p values:
| Value | Behavior | Use case |
|---|---|---|
| 0.1 | Very restrictive | Deterministic tasks |
| 0.5 | Moderately focused | Factual generation |
| 0.9 | Balanced (default) | General use |
| 0.95 | More diverse | Creative writing |
| 1.0 | All tokens | Maximum diversity |
Common questions
Q: What’s the difference between top-p and top-k?
A: Top-k always selects exactly k tokens regardless of their probabilities. Top-p selects a variable number based on cumulative probability. Top-p adapts: if one token has 95% probability, it selects just that one; if probabilities are spread, it selects many.
Q: What’s a good default top-p value?
A: 0.9 is a common default. It includes most reasonable tokens while excluding the long tail of unlikely options. For more focused output, try 0.5-0.7; for more creative, 0.95.
Q: Should I use top-p with temperature?
A: Yes, they complement each other. Temperature reshapes the probability distribution; top-p then samples from the adjusted distribution. A common combination: temperature 0.7 + top-p 0.9.
Q: Does top-p = 1.0 mean no filtering?
A: Effectively yes—all tokens are included since cumulative probability always reaches 1.0. This gives maximum diversity but may include nonsensical low-probability tokens.
Related terms
- Temperature — reshapes probability distribution
- Top-k Sampling — fixed-count alternative
- Beam Search — different decoding strategy
- Inference — generation process
References
Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration”, ICLR. [2,500+ citations]
Fan et al. (2018), “Hierarchical Neural Story Generation”, ACL. [1,000+ citations]
Radford et al. (2019), “Language Models are Unsupervised Multitask Learners”, OpenAI. [10,000+ citations]
Welleck et al. (2020), “Neural Text Generation With Unlikelihood Training”, ICLR. [500+ citations]