Capability · Comparison

Distillation vs Quantization

Distillation and quantization are the two fundamental ways to make an LLM cheaper to run. Distillation creates a smaller model that mimics the behavior of a larger one — you pay training cost once, then run a genuinely smaller model. Quantization reduces the precision of an existing model's weights (e.g., from 16-bit to 4-bit) — no training, but some quality loss.

Side-by-side

Criterion Distillation Quantization
What it changes Number of parameters Bit-precision of parameters
Training required Yes — student trained from teacher outputs No (post-training quantization) or light calibration
Typical size reduction 2-10x (e.g., 70B → 7B) 4x (bf16 → int4) to 2x (bf16 → int8)
Quality retention Depends heavily on training data Near-lossless up to int8; measurable drop at int4
Compute cost Full training run — $10k-$100k+ Hours on a single GPU (AWQ, GPTQ)
Flexibility New smaller model with its own weights Same model, smaller precision
Deployment impact Smaller and faster inference Smaller memory, often faster (with quantized kernels)
Famous examples R1-Distill, DistilBERT, TinyLlama GGUF (llama.cpp), AWQ, GPTQ, BitsAndBytes
Combinable Yes — quantize the distilled student Yes — quantize any model including distilled

Verdict

They solve different problems. Distillation is about making a new model that inherits behavior — good when you want a permanently smaller, faster model and you can afford training cost. Quantization is about making any existing model cheaper to serve — good when you want fast, cheap optimization with minimal quality impact. In production you often do both: distill to a smaller size, then quantize for deployment.

When to choose each

Choose Distillation if…

  • You want a fundamentally smaller model, not just a smaller memory footprint.
  • You have large amounts of teacher-generated data.
  • You need a specialized model (e.g., coding-only) from a generalist teacher.
  • You can afford a real training run.

Choose Quantization if…

  • You just want cheaper inference of an existing model.
  • You don't want to retrain anything.
  • You need to fit a model into consumer GPU memory.
  • You're deploying to edge devices (phones, laptops).

Frequently asked questions

Which preserves quality better?

Quantization at int8 is near-lossless on most LLMs; int4 typically loses 1-3 points on benchmarks. Distillation quality depends entirely on training data — a bad distill is worse than int4 on a full model.

Can I quantize a distilled model?

Yes — this is common. R1-Distill-Qwen-32B-Q4 on a consumer card is a typical edge deployment recipe.

What's the best quantization method today?

AWQ and GPTQ for GPU serving (via vLLM), GGUF Q4_K_M for CPU/llama.cpp, FP8 on H100-class hardware. For MoE models, int4 has a bigger drop; FP8 is usually the right tradeoff.

Sources

  1. Distillation (Hinton et al., 2015) — accessed 2026-04-20
  2. AWQ paper — accessed 2026-04-20