Curiosity · Concept
Temperature Sampling
Temperature is the simplest decoding parameter in an LLM. The model produces logits over the vocabulary; dividing those logits by a temperature T > 0 and applying softmax rescales the probability distribution. At T=1.0 you get the model's native distribution; T close to 0 approaches greedy argmax (deterministic, repetitive); T > 1 flattens the distribution, sampling rarer tokens. Temperature is usually combined with top-k or top-p truncation — sampling from the truncated, temperature-scaled distribution gives the best quality/creativity trade-off.
Quick reference
- Proficiency
- Beginner
- Also known as
- softmax temperature, decoding temperature
- Prerequisites
- Softmax, Autoregressive generation
Frequently asked questions
What does temperature do in LLM sampling?
Temperature rescales the logits before the softmax. Lower temperature concentrates probability on the top tokens (more deterministic output); higher temperature flattens the distribution (more random, diverse output).
What's a typical temperature value?
For factual tasks like classification, code generation, and tool calling, 0.0-0.3 is common. For open-ended writing and brainstorming, 0.7-1.0. Above ~1.3 text often becomes incoherent without careful truncation.
How does temperature interact with top-p and top-k?
Truncation (top-p / top-k) decides which tokens are candidates; temperature decides how sharply the remaining candidates are weighted. Using both is standard: e.g., top-p=0.9 with T=0.7.
Why do reasoning models sometimes use T=0?
Chain-of-thought math/code tasks often benefit from deterministic decoding, so a single wrong token doesn't derail a long trace. Many benchmark runs use T=0 plus self-consistency sampling at higher T for robustness.
Sources
- Ackley, Hinton, Sejnowski — A Learning Algorithm for Boltzmann Machines (origin of softmax temperature) — accessed 2026-04-20
- Holtzman et al. — The Curious Case of Neural Text Degeneration — accessed 2026-04-20