Skip to main content
AI & Machine Learning

Top-p Sampling

A sampling method that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p.

Also known as: Nucleus sampling, Top-p decoding, Probability mass sampling

Definition

Top-p sampling (also called nucleus sampling) is a text generation strategy that dynamically selects from the smallest possible set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k which uses a fixed count, top-p adapts to the confidence of the model—selecting fewer tokens when the model is confident, more when uncertain.

Why it matters

Top-p provides intelligent control over output diversity:

  • Adaptive selection — adjusts candidate pool based on model confidence
  • Quality balance — excludes low-probability tokens that cause incoherence
  • Flexibility — works across different contexts without manual tuning
  • Complementary — combines well with temperature for fine control
  • Production standard — default sampling method in most LLM APIs

Top-p often produces more natural text than fixed top-k sampling.

How it works

┌────────────────────────────────────────────────────────────┐
│                   TOP-P (NUCLEUS) SAMPLING                 │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Token probabilities (sorted high to low):                 │
│                                                            │
│  Token    Prob    Cumulative                               │
│  ─────────────────────────────                             │
│  "the"    0.35    0.35                                     │
│  "a"      0.25    0.60                                     │
│  "this"   0.15    0.75                                     │
│  "that"   0.10    0.85  ◄── p=0.9 threshold               │
│  "one"    0.08    0.93  ◄── included (crosses 0.9)        │
│  "some"   0.04    0.97      excluded                       │
│  "my"     0.02    0.99      excluded                       │
│  "your"   0.01    1.00      excluded                       │
│                                                            │
│  ┌────────────────────────────────────────────────┐        │
│  │  TOP-P = 0.9                                   │        │
│  │                                                │        │
│  │  Selected nucleus: [the, a, this, that, one]  │        │
│  │  Sample from these 5 tokens only              │        │
│  │                                                │        │
│  │  ████████████████████████░░░░░░░░             │        │
│  │  ▲                      ▲                     │        │
│  │  Included (93%)         Excluded (7%)         │        │
│  └────────────────────────────────────────────────┘        │
│                                                            │
│  ADAPTIVE BEHAVIOR:                                        │
│  • Confident prediction → selects 2-3 tokens               │
│  • Uncertain prediction → selects 10-20 tokens             │
│                                                            │
└────────────────────────────────────────────────────────────┘

Common top-p values:

ValueBehaviorUse case
0.1Very restrictiveDeterministic tasks
0.5Moderately focusedFactual generation
0.9Balanced (default)General use
0.95More diverseCreative writing
1.0All tokensMaximum diversity

Common questions

Q: What’s the difference between top-p and top-k?

A: Top-k always selects exactly k tokens regardless of their probabilities. Top-p selects a variable number based on cumulative probability. Top-p adapts: if one token has 95% probability, it selects just that one; if probabilities are spread, it selects many.

Q: What’s a good default top-p value?

A: 0.9 is a common default. It includes most reasonable tokens while excluding the long tail of unlikely options. For more focused output, try 0.5-0.7; for more creative, 0.95.

Q: Should I use top-p with temperature?

A: Yes, they complement each other. Temperature reshapes the probability distribution; top-p then samples from the adjusted distribution. A common combination: temperature 0.7 + top-p 0.9.

Q: Does top-p = 1.0 mean no filtering?

A: Effectively yes—all tokens are included since cumulative probability always reaches 1.0. This gives maximum diversity but may include nonsensical low-probability tokens.


References

Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration”, ICLR. [2,500+ citations]

Fan et al. (2018), “Hierarchical Neural Story Generation”, ACL. [1,000+ citations]

Radford et al. (2019), “Language Models are Unsupervised Multitask Learners”, OpenAI. [10,000+ citations]

Welleck et al. (2020), “Neural Text Generation With Unlikelihood Training”, ICLR. [500+ citations]