Curiosity · Concept

Top-k Sampling

Top-k sampling, popularised by Fan et al.'s 2018 'Hierarchical Neural Story Generation' paper, is the older sibling of nucleus sampling. At each step you keep only the k tokens with highest logits, renormalize their probabilities to sum to 1, and sample from that truncated distribution. Top-k is cheap and effective at avoiding the long tail of implausible tokens, but a fixed k is clumsy when the distribution's entropy varies step-to-step — which is why most modern stacks prefer top-p or combine top-k with top-p.

Quick reference

Proficiency: Beginner
Also known as: top-k
Prerequisites: Temperature sampling

Frequently asked questions

What is top-k sampling?

Top-k sampling keeps the k tokens with highest logits, renormalizes their probabilities, and samples from the truncated distribution. Tokens outside the top-k are set to zero probability.

How does top-k compare to top-p?

Top-k uses a fixed number of candidates regardless of confidence. Top-p (nucleus) picks a dynamic number based on cumulative probability. Top-p usually gives better results because it adapts to how peaky or flat the distribution is.

What k should I use?

k=40 or 50 are classic defaults, inherited from early GPT-2 era work. In practice, most modern decoders set a large k (e.g., 40-100) purely as a safety truncation and rely on top-p for fine control.

Can top-k produce completely deterministic output?

Setting k=1 is equivalent to greedy decoding — you always pick the argmax token. Combined with temperature=0, decoding becomes fully deterministic.

Sources

Fan et al. — Hierarchical Neural Story Generation — accessed 2026-04-20
Holtzman et al. — The Curious Case of Neural Text Degeneration — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related