Curiosity · Concept

Model Distillation

Distillation was introduced by Hinton, Vinyals, and Dean in 2015 and has become a core tool for shrinking frontier LLMs into deployable sizes. The student learns from the teacher's full output distribution (not just the hard label), absorbing 'dark knowledge' about how the teacher rates alternatives. Modern small models — Phi, Gemma, Llama 3.2, DeepSeek-R1-Distill — are largely distilled.

Quick reference

Proficiency: Intermediate
Also known as: knowledge distillation, KD, teacher-student training
Prerequisites: Fine-tuning, Neural networks

Frequently asked questions

What is knowledge distillation?

It is training a smaller student model to reproduce a larger teacher model's behavior. Instead of learning from ground-truth labels alone, the student learns from the teacher's richer output distribution, which encodes more information per example than a hard label.

Why distill instead of just training a small model from scratch?

A small model trained directly on a dataset caps out at what that data supports. A student trained on a strong teacher's outputs benefits from the teacher's broader training and reasoning, often beating an equally-sized model trained on raw data. DeepSeek-R1-Distill-Qwen models showed this dramatically for reasoning.

What are the main types of distillation?

Response distillation — the student is fine-tuned on (prompt, teacher-response) pairs. Logit/output distillation — the student matches the teacher's full probability distribution over tokens. Feature distillation — intermediate hidden states are also matched. Hybrid chain-of-thought distillation teaches the student to reproduce reasoning traces.

What are the downsides of distillation?

Students inherit teacher mistakes, biases, and hallucinations. They rarely exceed teacher quality on the teacher's strong domains. Distillation can also narrow the student's distribution compared to diverse pretraining, leading to mode collapse on edge-case inputs.

Sources

Hinton et al. — Distilling the Knowledge in a Neural Network — accessed 2026-04-20
DeepSeek-R1 Technical Report — accessed 2026-04-20
Hugging Face — Knowledge Distillation — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related