Capability · Comparison
Distillation vs Quantization
Distillation and quantization are the two fundamental ways to make an LLM cheaper to run. Distillation creates a smaller model that mimics the behavior of a larger one — you pay training cost once, then run a genuinely smaller model. Quantization reduces the precision of an existing model's weights (e.g., from 16-bit to 4-bit) — no training, but some quality loss.
Side-by-side
| Criterion | Distillation | Quantization |
|---|---|---|
| What it changes | Number of parameters | Bit-precision of parameters |
| Training required | Yes — student trained from teacher outputs | No (post-training quantization) or light calibration |
| Typical size reduction | 2-10x (e.g., 70B → 7B) | 4x (bf16 → int4) to 2x (bf16 → int8) |
| Quality retention | Depends heavily on training data | Near-lossless up to int8; measurable drop at int4 |
| Compute cost | Full training run — $10k-$100k+ | Hours on a single GPU (AWQ, GPTQ) |
| Flexibility | New smaller model with its own weights | Same model, smaller precision |
| Deployment impact | Smaller and faster inference | Smaller memory, often faster (with quantized kernels) |
| Famous examples | R1-Distill, DistilBERT, TinyLlama | GGUF (llama.cpp), AWQ, GPTQ, BitsAndBytes |
| Combinable | Yes — quantize the distilled student | Yes — quantize any model including distilled |
Verdict
They solve different problems. Distillation is about making a new model that inherits behavior — good when you want a permanently smaller, faster model and you can afford training cost. Quantization is about making any existing model cheaper to serve — good when you want fast, cheap optimization with minimal quality impact. In production you often do both: distill to a smaller size, then quantize for deployment.
When to choose each
Choose Distillation if…
- You want a fundamentally smaller model, not just a smaller memory footprint.
- You have large amounts of teacher-generated data.
- You need a specialized model (e.g., coding-only) from a generalist teacher.
- You can afford a real training run.
Choose Quantization if…
- You just want cheaper inference of an existing model.
- You don't want to retrain anything.
- You need to fit a model into consumer GPU memory.
- You're deploying to edge devices (phones, laptops).
Frequently asked questions
Which preserves quality better?
Quantization at int8 is near-lossless on most LLMs; int4 typically loses 1-3 points on benchmarks. Distillation quality depends entirely on training data — a bad distill is worse than int4 on a full model.
Can I quantize a distilled model?
Yes — this is common. R1-Distill-Qwen-32B-Q4 on a consumer card is a typical edge deployment recipe.
What's the best quantization method today?
AWQ and GPTQ for GPU serving (via vLLM), GGUF Q4_K_M for CPU/llama.cpp, FP8 on H100-class hardware. For MoE models, int4 has a bigger drop; FP8 is usually the right tradeoff.
Sources
- Distillation (Hinton et al., 2015) — accessed 2026-04-20
- AWQ paper — accessed 2026-04-20