Curiosity · Concept
Perplexity
Perplexity (PPL) is the exponential of cross-entropy loss per token. Intuitively, a perplexity of K means the model behaves, on average, like it is choosing uniformly among K equally likely next tokens. It is the canonical pre-training and language-modelling metric — you compute it on a held-out corpus like WikiText or C4. Perplexity correlates reasonably with downstream quality at scale but is not a substitute for task-level evaluation: a model with lower PPL is not automatically better on reasoning or instruction-following.
Quick reference
- Proficiency
- Beginner
- Also known as
- PPL
- Prerequisites
- Language modelling, Cross-entropy
Frequently asked questions
What is perplexity?
Perplexity is exp(average negative log-likelihood per token) of a model on held-out text. If PPL = 20, the model is about as uncertain as picking uniformly from 20 possible next tokens at each step.
Why do we use perplexity instead of raw cross-entropy?
They carry the same information (PPL = exp(NLL)), but perplexity is on a scale humans find intuitive: 'choosing among K tokens'. It also makes it easy to compare models with different tokenizers only when the token counts match.
Is lower perplexity always better?
Lower PPL on a clean held-out set generally correlates with stronger language modelling, especially at scale. But PPL can be gamed (trained on or near the eval set), and it doesn't directly measure reasoning, instruction following, or truthfulness.
Can I compare perplexity across tokenizers?
Not directly — PPL depends on how text is segmented. Always compare models with the same tokenizer, or convert to bits-per-byte / bits-per-character for tokenizer-agnostic comparisons.
Sources
- Brown, Della Pietra et al. — An Estimate of an Upper Bound for the Entropy of English (classical perplexity reference) — accessed 2026-04-20
- Hugging Face — Perplexity of fixed-length models — accessed 2026-04-20