Model Distillation for Inference

TL;DR

Compression technique where a small student model is trained on the outputs of a larger teacher model.
Standard recipes: response distillation (student matches teacher's text), logit distillation (student matches teacher's full probability distribution), and feature distillation (student matches teacher's hidden states).
Distilled models reach a much higher quality-per-parameter ratio than models trained from scratch on the same data.
Common pairing with pruning (prune then distill) and quantisation (distill then quantise) for production-ready small models.

Why Distill#

A 70B-parameter model is expensive to serve at scale. An 8B model on the same workload runs at perhaps a tenth of the cost. The catch is that an 8B trained from scratch on the same data lags noticeably behind the 70B on most benchmarks. Distillation aims to close that gap: train the 8B to reproduce the 70B's behaviour, then serve the cheap 8B.

The technique was introduced for vision models in Hinton et al. (2015) and adapted to LLMs throughout 2023-2024. The 2026 production playbook treats distillation as a standard step in the pipeline from research model to deployable endpoint.

Distillation Targets#

Response distillation — generate text from the teacher, fine-tune the student on the resulting prompt-response pairs. Simplest, widely used ("synthetic data" approaches).
Logit distillation — minimise the KL divergence between student and teacher token distributions at each position. Higher quality but needs both models in memory during training.
Feature distillation — align the student's hidden states with the teacher's via learned projection matrices. Useful when student and teacher have similar architectures.

Practical Pipeline#

A production distillation pipeline typically: (1) selects or trains a teacher; (2) generates a large prompt-response dataset from the teacher across the target task distribution; (3) fine-tunes a smaller student model on this dataset, optionally with logit-matching loss; (4) quantises the student to FP8 or INT4 for deployment.

When combined with pruning — first prune the teacher to a smaller intermediate, then distill into it — the result can recover a significant fraction of the original quality at a fraction of the parameter count. NVIDIA's Minitron line follows exactly this pattern.

Quality Retention#

Reported quality gaps vary widely by task. Code generation and structured tasks distill well; open-ended creative tasks and rare-knowledge questions distill poorly. Evaluation must cover the actual deployment workload — synthetic benchmarks frequently overstate distillation gains.

Distilled students inherit teacher biases and hallucinations. If the teacher confidently confabulates, the student will learn to do so just as confidently — with the same authority but a fraction of the parameters.

Inference Implications#

From a serving perspective, a distilled student is just a smaller LLM. It runs on the same runtimes (vLLM, TensorRT-LLM, SGLang, TGI), uses the same quantisation paths, and benefits from the same KV-cache optimisations. The benefit appears in the cost line: a 4x smaller model running on the same hardware delivers 4x the throughput at similar latency.

When to Distill#

Distillation pays off when (a) a specific task or domain needs to be served cheaply, (b) the teacher's quality on that task is substantially higher than what a small from-scratch model can reach, and (c) the team can maintain the data-generation and training pipeline. For general-purpose chat, picking an existing well-trained small model is usually faster than running a distillation pipeline from scratch.

References

Distilling the Knowledge in a Neural Network · arXiv (Hinton, Vinyals, Dean, 2015)
MiniLLM: Knowledge Distillation of Large Language Models · arXiv (Gu et al., 2023)
Compact Language Models via Pruning and Knowledge Distillation · arXiv (NVIDIA, 2024)

Why Distill#

Distillation Targets#

Response distillation — generate text from the teacher, fine-tune the student on the resulting prompt-response pairs. Simplest, widely used ("synthetic data" approaches).

Logit distillation — minimise the KL divergence between student and teacher token distributions at each position. Higher quality but needs both models in memory during training.

Feature distillation — align the student's hidden states with the teacher's via learned projection matrices. Useful when student and teacher have similar architectures.

Practical Pipeline#

Quality Retention#

Inference Implications#

When to Distill#