Curiosity · Concept

Speculative Decoding

Speculative decoding speeds up LLM inference without changing outputs. A small draft model quickly generates a few candidate tokens; the big target model verifies them in a single forward pass and accepts any prefix where the probabilities agree. When draft and target often agree, you generate multiple tokens per large-model step, delivering a 2-3x wall-clock speedup with bit-exact results.

Quick reference

Proficiency: Advanced
Also known as: speculative sampling, draft-and-verify decoding, SpS
Prerequisites: Transformer architecture, LLM inference basics, Autoregressive sampling

Frequently asked questions

What is speculative decoding?

An inference-time technique that pairs a fast small 'draft' model with the full-size 'target' model. The draft speculates several tokens ahead; the target verifies them all in one parallel forward pass, accepting any matching prefix. It gets the same output as the target alone but in fewer target-model calls.

How is the output guaranteed to match?

By a clever rejection-sampling rule (Leviathan et al., 2023). If the draft proposes token t and the target's probability for t is lower than the draft's, you accept t with probability target(t)/draft(t); otherwise you resample from a corrected distribution. The net distribution is identical to sampling from the target.

What speedup does it give?

Typically 2-3x on text generation with a well-matched draft. Medusa and EAGLE push this higher by using multiple prediction heads rather than a separate draft model. Code and templated outputs (where the draft matches easily) see larger speedups than open-ended creative writing.

What's the catch?

You need a compatible draft model, extra GPU memory for it, and careful engineering to overlap draft and target work. Acceptance rate also falls on distribution-shifted prompts. Implementations like vLLM, TensorRT-LLM, and llama.cpp include built-in support.

Sources

Leviathan et al. — Fast Inference from Transformers via Speculative Decoding — accessed 2026-04-20
Chen et al. — Accelerating Large Language Model Decoding with Speculative Sampling — accessed 2026-04-20
Cai et al. — Medusa — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related