Definition
Perplexity is a measurement of how well a probability model predicts a sample. For language models, it represents the model’s uncertainty when predicting the next token—lower perplexity means the model is less “perplexed” or more confident. Mathematically, perplexity is the exponentiated average negative log-likelihood per token.
Why it matters
Perplexity is a fundamental evaluation metric:
- Model comparison — compare different models on same dataset
- Training monitoring — track improvement during training
- Domain assessment — measure how well model fits specific text
- Quantization impact — evaluate quality loss from compression
- Interpretable scale — can be understood as effective vocabulary size
Perplexity helps answer: “How surprised is the model by this text?”
How it works
┌────────────────────────────────────────────────────────────┐
│ PERPLEXITY │
├────────────────────────────────────────────────────────────┤
│ │
│ Formula: PPL = exp(-1/N × Σ log P(token_i | context)) │
│ │
│ Example: "The cat sat" │
│ ────────────────────── │
│ │
│ Token P(token|context) log P │
│ ───────────────────────────────────────── │
│ "The" 0.10 -2.30 │
│ "cat" 0.25 -1.39 │
│ "sat" 0.40 -0.92 │
│ │
│ Average log P = (-2.30 + -1.39 + -0.92) / 3 = -1.54 │
│ Perplexity = exp(1.54) = 4.66 │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ INTERPRETATION: │ │
│ │ │ │
│ │ PPL ≈ "effective choices per position" │ │
│ │ │ │
│ │ PPL = 1: Model is 100% certain │ │
│ │ PPL = 10: ~10 equally likely options │ │
│ │ PPL = 50k: Random (vocab size) = no learning│ │
│ └────────────────────────────────────────────────┘ │
│ │
│ PERPLEXITY SCALE: │
│ ─────────────────── │
│ │◄────────────────────────────────────────────►│ │
│ 1 10 50 100 1000 50000 │
│ Perfect Great Good Okay Poor Random │
│ │
│ TYPICAL VALUES: │
│ ─────────────── │
│ GPT-4 on common text: ~10-20 │
│ Small model: ~50-100 │
│ Domain mismatch: ~100-500 │
│ Untrained model: ~vocab size │
│ │
└────────────────────────────────────────────────────────────┘
Perplexity benchmarks:
| Range | Quality | Interpretation |
|---|---|---|
| 1-10 | Excellent | Highly predictable text |
| 10-30 | Very good | Typical for strong LLMs |
| 30-50 | Good | Reasonable model |
| 50-100 | Moderate | May need improvement |
| 100+ | Poor | Significant issues or domain mismatch |
Common questions
Q: What’s a “good” perplexity score?
A: It depends on the dataset and model size. State-of-the-art models achieve perplexity ~15-25 on standard benchmarks like WikiText. Within a project, focus on relative improvements rather than absolute numbers.
Q: Can perplexity compare models with different tokenizers?
A: Not directly—different tokenizers produce different numbers of tokens for the same text. Compare models using the same tokenizer, or normalize by character/word count instead of token count.
Q: Why might perplexity be low but generation quality poor?
A: Perplexity measures average prediction accuracy, not output quality. A model can have low perplexity by predicting common words well while failing at coherent long-form generation. Use perplexity alongside other metrics.
Q: How does perplexity relate to cross-entropy loss?
A: Perplexity = exp(cross-entropy). They measure the same thing on different scales. Cross-entropy is typically used during training (easier gradient computation); perplexity is more interpretable for reporting.
Related terms
- Loss Function — training objective
- LLM — language models being evaluated
- Tokenization — affects perplexity calculation
- Fine-tuning — improves domain perplexity
References
Jelinek et al. (1977), “Perplexity—a measure of the difficulty of speech recognition tasks”, JASA. [Foundational paper]
Merity et al. (2017), “Regularizing and Optimizing LSTM Language Models”, ICLR. [1,500+ citations]
Brown et al. (2020), “Language Models are Few-Shot Learners”, NeurIPS. [15,000+ citations]
Radford et al. (2019), “Language Models are Unsupervised Multitask Learners”, OpenAI. [10,000+ citations]