Curiosity · Concept
Top-k Sampling
Top-k sampling, popularised by Fan et al.'s 2018 'Hierarchical Neural Story Generation' paper, is the older sibling of nucleus sampling. At each step you keep only the k tokens with highest logits, renormalize their probabilities to sum to 1, and sample from that truncated distribution. Top-k is cheap and effective at avoiding the long tail of implausible tokens, but a fixed k is clumsy when the distribution's entropy varies step-to-step — which is why most modern stacks prefer top-p or combine top-k with top-p.
Quick reference
- Proficiency
- Beginner
- Also known as
- top-k
- Prerequisites
- Temperature sampling
Frequently asked questions
What is top-k sampling?
Top-k sampling keeps the k tokens with highest logits, renormalizes their probabilities, and samples from the truncated distribution. Tokens outside the top-k are set to zero probability.
How does top-k compare to top-p?
Top-k uses a fixed number of candidates regardless of confidence. Top-p (nucleus) picks a dynamic number based on cumulative probability. Top-p usually gives better results because it adapts to how peaky or flat the distribution is.
What k should I use?
k=40 or 50 are classic defaults, inherited from early GPT-2 era work. In practice, most modern decoders set a large k (e.g., 40-100) purely as a safety truncation and rely on top-p for fine control.
Can top-k produce completely deterministic output?
Setting k=1 is equivalent to greedy decoding — you always pick the argmax token. Combined with temperature=0, decoding becomes fully deterministic.
Sources
- Fan et al. — Hierarchical Neural Story Generation — accessed 2026-04-20
- Holtzman et al. — The Curious Case of Neural Text Degeneration — accessed 2026-04-20