Knowledge Distillation

TL;DR

Knowledge distillation (Hinton, Vinyals, Dean, 2015, arXiv:1503.02531) trains a small student model to match the soft output distribution of a larger teacher model, transferring capabilities the student could not learn from labels alone.
Two flavours dominate in LLMs: black-box distillation (student trains on teacher-generated text) and white-box distillation (student matches teacher logits or hidden states directly).
The Orca, Phi, Gemma 2, and Llama 3.2 small models are all distilled in some form. The technique is the single biggest reason 1-8B open models perform as well as they do.
Distillation is bounded by the teacher: the student approaches but generally does not exceed teacher quality on the distilled distribution.

The Original Formulation#

Hinton, Vinyals, and Dean's 2015 paper framed distillation as a classification problem. A large teacher network produces logits — pre-softmax scores — over a class space. The 'dark knowledge' the authors identified is in the relative values of the non-top classes: the teacher's relative confidence between, say, 'dog' and 'cat' carries information the one-hot label does not.

The recipe: divide the teacher's logits by a temperature T > 1 to soften the distribution, divide the student's logits by the same T, and train the student to match the soft distribution via KL divergence. The student also trains on the hard labels via standard cross-entropy. The combined loss transfers the teacher's calibration along with its predictions.

Distillation in LLMs#

For language models the same idea applies, but the 'class space' is the vocabulary of the next-token prediction. Two practical approaches have emerged.

Black-box (sequence-level) distillation: have the teacher generate text, train the student on that text with standard cross-entropy. Works with any teacher behind any API. This is what Alpaca, Orca, and most synthetic-data pipelines effectively do.
White-box (token-level) distillation: train the student to match the teacher's full next-token probability distribution at each position, usually via forward or reverse KL. Requires logit access — possible only with open-weight teachers but materially more sample-efficient.

Loss Variants#

Loss	Direction	Use case
Forward KL	KL(teacher ‖ student)	Mode-covering — student spreads probability mass
Reverse KL	KL(student ‖ teacher)	Mode-seeking — student picks one mode (often better for chat)
Jensen-Shannon	Symmetric KL	Compromise between forward and reverse
MiniLLM (skewed reverse KL)	Reverse with stabiliser	State of the art for white-box LLM distil
Sequence-level cross-entropy	Hard targets only	Black-box — text from teacher

Reverse KL has emerged as the preferred direction for white-box LLM distillation. Forward KL encourages the student to put mass on every token the teacher considered plausible — which leads to bland, hedging outputs. Reverse KL lets the student commit to the highest-probability mode.

Notable Distilled LLMs#

DistilBERT (2019) — 40% smaller BERT, 97% of GLUE quality. The original LLM distillation success story.
Orca (2023) — distilled GPT-4 reasoning traces into a 13B Mistral; demonstrated reasoning could be transferred via explanation.
Phi-1, Phi-2, Phi-3, Phi-4 (Microsoft) — small models trained primarily on synthetic and distilled data of carefully chosen quality.
Gemma 2 9B and 2B — Google's distillation of larger Gemini models into open weights.
Llama 3.2 1B and 3B — Meta's edge-focused distillations using a combination of pruning and knowledge transfer.
DeepSeek-R1-Distill series — distilled reasoning traces from DeepSeek-R1 into Llama and Qwen students.

Trade-offs#

Pro: cheap and effective — a small student can absorb most of a teacher's capability at a fraction of the inference cost.
Pro: works as a compression technique — large internal model → small deployed model.
Pro: composes with quantisation, pruning, and standard fine-tuning.
Con: bounded by teacher quality — distillation does not exceed the teacher.
Con: inherits teacher biases and hallucinations wholesale.
Con: white-box distillation requires tokenizer alignment between teacher and student.

When to Distil#

Distil when you have a strong teacher (open weights or API access) and need a smaller, cheaper deployment target. Distil specifically when the student needs reasoning or behaviours that supervised data alone cannot easily produce — chain-of-thought, complex instruction following, agent loops. For straightforward task fine-tuning where labels are easy to collect, standard SFT is usually simpler and almost as effective.

References

Distilling the Knowledge in a Neural Network · arXiv (Hinton, Vinyals, Dean, 2015)
Orca: Progressive Learning from Complex Explanation Traces · arXiv (Mukherjee et al., 2023)
MiniLLM: Knowledge Distillation of Large Language Models · arXiv (Gu et al., 2023)

The Original Formulation#

Distillation in LLMs#

For language models the same idea applies, but the 'class space' is the vocabulary of the next-token prediction. Two practical approaches have emerged.

Black-box (sequence-level) distillation: have the teacher generate text, train the student on that text with standard cross-entropy. Works with any teacher behind any API. This is what Alpaca, Orca, and most synthetic-data pipelines effectively do.

White-box (token-level) distillation: train the student to match the teacher's full next-token probability distribution at each position, usually via forward or reverse KL. Requires logit access — possible only with open-weight teachers but materially more sample-efficient.

Loss Variants#

Loss	Direction	Use case
Forward KL	KL(teacher ‖ student)	Mode-covering — student spreads probability mass
Reverse KL	KL(student ‖ teacher)	Mode-seeking — student picks one mode (often better for chat)
Jensen-Shannon	Symmetric KL	Compromise between forward and reverse
MiniLLM (skewed reverse KL)	Reverse with stabiliser	State of the art for white-box LLM distil
Sequence-level cross-entropy	Hard targets only	Black-box — text from teacher

Notable Distilled LLMs#

DistilBERT (2019) — 40% smaller BERT, 97% of GLUE quality. The original LLM distillation success story.

Orca (2023) — distilled GPT-4 reasoning traces into a 13B Mistral; demonstrated reasoning could be transferred via explanation.

Phi-1, Phi-2, Phi-3, Phi-4 (Microsoft) — small models trained primarily on synthetic and distilled data of carefully chosen quality.

Gemma 2 9B and 2B — Google's distillation of larger Gemini models into open weights.

Llama 3.2 1B and 3B — Meta's edge-focused distillations using a combination of pruning and knowledge transfer.

DeepSeek-R1-Distill series — distilled reasoning traces from DeepSeek-R1 into Llama and Qwen students.

Trade-offs#

Pro: cheap and effective — a small student can absorb most of a teacher's capability at a fraction of the inference cost.

Pro: works as a compression technique — large internal model → small deployed model.

Pro: composes with quantisation, pruning, and standard fine-tuning.

Con: bounded by teacher quality — distillation does not exceed the teacher.

Con: inherits teacher biases and hallucinations wholesale.

Con: white-box distillation requires tokenizer alignment between teacher and student.

When to Distil#

Knowledge Distillation

The Original Formulation#

Distillation in LLMs#

Loss Variants#

Notable Distilled LLMs#

Trade-offs#

When to Distil#

References

Browse all entries

Deploy on Yobitel

Knowledge Distillation

The Original Formulation#

Distillation in LLMs#

Loss Variants#

Notable Distilled LLMs#

Trade-offs#

When to Distil#

References

Browse all entries

Deploy on Yobitel