Curiosity · Concept

Temperature Sampling

Temperature is the simplest decoding parameter in an LLM. The model produces logits over the vocabulary; dividing those logits by a temperature T > 0 and applying softmax rescales the probability distribution. At T=1.0 you get the model's native distribution; T close to 0 approaches greedy argmax (deterministic, repetitive); T > 1 flattens the distribution, sampling rarer tokens. Temperature is usually combined with top-k or top-p truncation — sampling from the truncated, temperature-scaled distribution gives the best quality/creativity trade-off.

Quick reference

Proficiency
Beginner
Also known as
softmax temperature, decoding temperature
Prerequisites
Softmax, Autoregressive generation

Frequently asked questions

What does temperature do in LLM sampling?

Temperature rescales the logits before the softmax. Lower temperature concentrates probability on the top tokens (more deterministic output); higher temperature flattens the distribution (more random, diverse output).

What's a typical temperature value?

For factual tasks like classification, code generation, and tool calling, 0.0-0.3 is common. For open-ended writing and brainstorming, 0.7-1.0. Above ~1.3 text often becomes incoherent without careful truncation.

How does temperature interact with top-p and top-k?

Truncation (top-p / top-k) decides which tokens are candidates; temperature decides how sharply the remaining candidates are weighted. Using both is standard: e.g., top-p=0.9 with T=0.7.

Why do reasoning models sometimes use T=0?

Chain-of-thought math/code tasks often benefit from deterministic decoding, so a single wrong token doesn't derail a long trace. Many benchmark runs use T=0 plus self-consistency sampling at higher T for robustness.

Sources

  1. Ackley, Hinton, Sejnowski — A Learning Algorithm for Boltzmann Machines (origin of softmax temperature) — accessed 2026-04-20
  2. Holtzman et al. — The Curious Case of Neural Text Degeneration — accessed 2026-04-20