TL;DR
- Knowledge distillation (Hinton, Vinyals, Dean, 2015, arXiv:1503.02531) trains a small student model to match the soft output distribution of a larger teacher model, transferring capabilities the student could not learn from labels alone.
- Two flavours dominate in LLMs: black-box distillation (student trains on teacher-generated text) and white-box distillation (student matches teacher logits or hidden states directly).
- The Orca, Phi, Gemma 2, and Llama 3.2 small models are all distilled in some form. The technique is the single biggest reason 1-8B open models perform as well as they do.
- Distillation is bounded by the teacher: the student approaches but generally does not exceed teacher quality on the distilled distribution.
The Original Formulation#
Hinton, Vinyals, and Dean's 2015 paper framed distillation as a classification problem. A large teacher network produces logits — pre-softmax scores — over a class space. The 'dark knowledge' the authors identified is in the relative values of the non-top classes: the teacher's relative confidence between, say, 'dog' and 'cat' carries information the one-hot label does not.
The recipe: divide the teacher's logits by a temperature T > 1 to soften the distribution, divide the student's logits by the same T, and train the student to match the soft distribution via KL divergence. The student also trains on the hard labels via standard cross-entropy. The combined loss transfers the teacher's calibration along with its predictions.
Distillation in LLMs#
For language models the same idea applies, but the 'class space' is the vocabulary of the next-token prediction. Two practical approaches have emerged.
- Black-box (sequence-level) distillation: have the teacher generate text, train the student on that text with standard cross-entropy. Works with any teacher behind any API. This is what Alpaca, Orca, and most synthetic-data pipelines effectively do.
- White-box (token-level) distillation: train the student to match the teacher's full next-token probability distribution at each position, usually via forward or reverse KL. Requires logit access — possible only with open-weight teachers but materially more sample-efficient.
Loss Variants#
| Loss | Direction | Use case |
|---|---|---|
| Forward KL | KL(teacher ‖ student) | Mode-covering — student spreads probability mass |
| Reverse KL | KL(student ‖ teacher) | Mode-seeking — student picks one mode (often better for chat) |
| Jensen-Shannon | Symmetric KL | Compromise between forward and reverse |
| MiniLLM (skewed reverse KL) | Reverse with stabiliser | State of the art for white-box LLM distil |
| Sequence-level cross-entropy | Hard targets only | Black-box — text from teacher |
Reverse KL has emerged as the preferred direction for white-box LLM distillation. Forward KL encourages the student to put mass on every token the teacher considered plausible — which leads to bland, hedging outputs. Reverse KL lets the student commit to the highest-probability mode.
Notable Distilled LLMs#
- DistilBERT (2019) — 40% smaller BERT, 97% of GLUE quality. The original LLM distillation success story.
- Orca (2023) — distilled GPT-4 reasoning traces into a 13B Mistral; demonstrated reasoning could be transferred via explanation.
- Phi-1, Phi-2, Phi-3, Phi-4 (Microsoft) — small models trained primarily on synthetic and distilled data of carefully chosen quality.
- Gemma 2 9B and 2B — Google's distillation of larger Gemini models into open weights.
- Llama 3.2 1B and 3B — Meta's edge-focused distillations using a combination of pruning and knowledge transfer.
- DeepSeek-R1-Distill series — distilled reasoning traces from DeepSeek-R1 into Llama and Qwen students.
Trade-offs#
- Pro: cheap and effective — a small student can absorb most of a teacher's capability at a fraction of the inference cost.
- Pro: works as a compression technique — large internal model → small deployed model.
- Pro: composes with quantisation, pruning, and standard fine-tuning.
- Con: bounded by teacher quality — distillation does not exceed the teacher.
- Con: inherits teacher biases and hallucinations wholesale.
- Con: white-box distillation requires tokenizer alignment between teacher and student.
When to Distil#
Distil when you have a strong teacher (open weights or API access) and need a smaller, cheaper deployment target. Distil specifically when the student needs reasoning or behaviours that supervised data alone cannot easily produce — chain-of-thought, complex instruction following, agent loops. For straightforward task fine-tuning where labels are easy to collect, standard SFT is usually simpler and almost as effective.
References
- Distilling the Knowledge in a Neural Network · arXiv (Hinton, Vinyals, Dean, 2015)
- Orca: Progressive Learning from Complex Explanation Traces · arXiv (Mukherjee et al., 2023)
- MiniLLM: Knowledge Distillation of Large Language Models · arXiv (Gu et al., 2023)