Curiosity · Concept
Speculative Decoding
Speculative decoding speeds up LLM inference without changing outputs. A small draft model quickly generates a few candidate tokens; the big target model verifies them in a single forward pass and accepts any prefix where the probabilities agree. When draft and target often agree, you generate multiple tokens per large-model step, delivering a 2-3x wall-clock speedup with bit-exact results.
Quick reference
- Proficiency
- Advanced
- Also known as
- speculative sampling, draft-and-verify decoding, SpS
- Prerequisites
- Transformer architecture, LLM inference basics, Autoregressive sampling
Frequently asked questions
What is speculative decoding?
An inference-time technique that pairs a fast small 'draft' model with the full-size 'target' model. The draft speculates several tokens ahead; the target verifies them all in one parallel forward pass, accepting any matching prefix. It gets the same output as the target alone but in fewer target-model calls.
How is the output guaranteed to match?
By a clever rejection-sampling rule (Leviathan et al., 2023). If the draft proposes token t and the target's probability for t is lower than the draft's, you accept t with probability target(t)/draft(t); otherwise you resample from a corrected distribution. The net distribution is identical to sampling from the target.
What speedup does it give?
Typically 2-3x on text generation with a well-matched draft. Medusa and EAGLE push this higher by using multiple prediction heads rather than a separate draft model. Code and templated outputs (where the draft matches easily) see larger speedups than open-ended creative writing.
What's the catch?
You need a compatible draft model, extra GPU memory for it, and careful engineering to overlap draft and target work. Acceptance rate also falls on distribution-shifted prompts. Implementations like vLLM, TensorRT-LLM, and llama.cpp include built-in support.
Sources
- Leviathan et al. — Fast Inference from Transformers via Speculative Decoding — accessed 2026-04-20
- Chen et al. — Accelerating Large Language Model Decoding with Speculative Sampling — accessed 2026-04-20
- Cai et al. — Medusa — accessed 2026-04-20