Capability · Comparison

Distillation vs Quantization

Distillation and quantization are the two fundamental ways to make an LLM cheaper to run. Distillation creates a smaller model that mimics the behavior of a larger one — you pay training cost once, then run a genuinely smaller model. Quantization reduces the precision of an existing model's weights (e.g., from 16-bit to 4-bit) — no training, but some quality loss.

Side-by-side

Criterion	Distillation	Quantization
What it changes	Number of parameters	Bit-precision of parameters
Training required	Yes — student trained from teacher outputs	No (post-training quantization) or light calibration
Typical size reduction	2-10x (e.g., 70B → 7B)	4x (bf16 → int4) to 2x (bf16 → int8)
Quality retention	Depends heavily on training data	Near-lossless up to int8; measurable drop at int4
Compute cost	Full training run — $10k-$100k+	Hours on a single GPU (AWQ, GPTQ)
Flexibility	New smaller model with its own weights	Same model, smaller precision
Deployment impact	Smaller and faster inference	Smaller memory, often faster (with quantized kernels)
Famous examples	R1-Distill, DistilBERT, TinyLlama	GGUF (llama.cpp), AWQ, GPTQ, BitsAndBytes
Combinable	Yes — quantize the distilled student	Yes — quantize any model including distilled

Verdict

They solve different problems. Distillation is about making a new model that inherits behavior — good when you want a permanently smaller, faster model and you can afford training cost. Quantization is about making any existing model cheaper to serve — good when you want fast, cheap optimization with minimal quality impact. In production you often do both: distill to a smaller size, then quantize for deployment.

When to choose each

Choose Distillation if…

You want a fundamentally smaller model, not just a smaller memory footprint.
You have large amounts of teacher-generated data.
You need a specialized model (e.g., coding-only) from a generalist teacher.
You can afford a real training run.

Choose Quantization if…

You just want cheaper inference of an existing model.
You don't want to retrain anything.
You need to fit a model into consumer GPU memory.
You're deploying to edge devices (phones, laptops).

Frequently asked questions

Which preserves quality better?

Quantization at int8 is near-lossless on most LLMs; int4 typically loses 1-3 points on benchmarks. Distillation quality depends entirely on training data — a bad distill is worse than int4 on a full model.

Can I quantize a distilled model?

Yes — this is common. R1-Distill-Qwen-32B-Q4 on a consumer card is a typical edge deployment recipe.

What's the best quantization method today?

AWQ and GPTQ for GPU serving (via vLLM), GGUF Q4_K_M for CPU/llama.cpp, FP8 on H100-class hardware. For MoE models, int4 has a bigger drop; FP8 is usually the right tradeoff.

Sources

Distillation (Hinton et al., 2015) — accessed 2026-04-20
AWQ paper — accessed 2026-04-20