Curiosity · Concept

Top-p (Nucleus) Sampling

Nucleus sampling, introduced by Holtzman et al. in 'The Curious Case of Neural Text Degeneration' (2019), fixes the main problem of top-k sampling: a fixed k is too tight when the model is uncertain and too loose when it is confident. Top-p instead keeps the smallest set of tokens whose probabilities sum to at least p (say 0.9), renormalizes, and samples from that nucleus. The effective candidate set grows and shrinks with the entropy of the distribution, giving fluent output on easy tokens and still allowing diversity when the model is uncertain.

Quick reference

Proficiency: Beginner
Also known as: nucleus sampling, top-p
Prerequisites: Temperature sampling

Frequently asked questions

What is top-p (nucleus) sampling?

Top-p sampling sorts next-token probabilities in descending order, keeps the smallest prefix whose sum exceeds p, renormalizes, and samples from that reduced distribution. The number of candidates varies with the model's confidence.

Why is top-p better than top-k in most cases?

Top-k always keeps the same number of tokens, which is too aggressive when the distribution is flat (low-confidence step) and too permissive when the distribution is peaked. Top-p adapts: more candidates when uncertain, fewer when confident.

What p value should I use?

p=0.9 to 0.95 is standard for chat and creative writing. Lower p (0.5-0.7) tightens output for more factual tasks. p=1.0 is equivalent to full sampling.

Can I combine top-p with temperature and top-k?

Yes, and it's common. Apply temperature first to reshape the distribution, then top-k truncates to K candidates, then top-p further restricts to the nucleus. Most inference libraries (HF transformers, vLLM, OpenAI API) support all three together.

Sources

Holtzman et al. — The Curious Case of Neural Text Degeneration — accessed 2026-04-20
Hugging Face — Text generation strategies docs — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related