LLM Distillation

LLM Distillation is the process of training a smaller student language model to mimic the behavior of a larger teacher model — using the teacher's output distributions, reasoning chains, or generated training data to transfer capabilities that would normally require massive scale, enabling models with 1-10B parameters to achieve performance approaching much larger 70B-400B models at a fraction of the inference cost, making distillation the primary technique behind efficient deployment-ready models.

Distillation Approaches for LLMs

| Approach | What's Transferred | Data Required | Effectiveness |
|----------|-------------------|-------------|---------------|
| Logit distillation | Full output probability distribution | None (forward pass teacher) | Highest quality |
| Chain-of-thought distillation | Reasoning steps from teacher | Generated CoT data | Strong for reasoning |
| Synthetic data distillation | Teacher-generated training examples | Generated Q&A pairs | Most practical |
| Feature distillation | Intermediate layer representations | None (forward pass) | Moderate |
| Preference distillation | Teacher preference rankings | Pairwise comparisons | Good for alignment |

Logit-Based Distillation

``Standard training: Student loss = CrossEntropy(student_logits, hard_labels) Only learns: correct answer = 1, everything else = 0

Knowledge distillation: Student loss = α × CE(student_logits, hard_labels) + β × KL(softmax(student_logits/T), softmax(teacher_logits/T)) Learns: Full distribution — "cat" is 70% likely, "kitten" 15%, "dog" 3%... Dark knowledge: Relative probabilities of wrong answers carry structure`

Synthetic Data Distillation (Most Common for LLMs)

`Step 1: Generate training data using teacher Teacher (GPT-4 / Claude) generates: - Instruction-response pairs - Multi-turn conversations - Chain-of-thought reasoning - Code solutions with explanations

Step 2: Filter generated data - Remove incorrect/low-quality responses - Decontaminate for benchmark fairness - Diverse topic sampling

Step 3: Fine-tune student on teacher data Student (7B model) → SFT on teacher-generated data Often 100K-1M examples sufficient`

Notable Distilled Models

| Student | Teacher | Size Ratio | Performance | Method | |---------|---------|-----------|------------|--------| | Alpaca (7B) | text-davinci-003 | 26× smaller | Good for chat | 52K synthetic examples | | Vicuna (13B) | ChatGPT | 10× smaller | 90% of ChatGPT quality | 70K ShareGPT conversations | | Phi-1.5 (1.3B) | GPT-4 (synthetic) | 1000× smaller | ≈ Llama-7B | 30B synthetic tokens | | Orca 2 (7B) | GPT-4 | 200× smaller | ≈ ChatGPT | Explanation tuning | | DeepSeek-R1-Distill | DeepSeek-R1 | 10-100× smaller | Strong reasoning | CoT distillation |

Chain-of-Thought Distillation

`Teacher generates reasoning chains: Q: "If a train travels 120 km in 2 hours, what is its average speed?" Teacher CoT: "To find average speed, I divide total distance by total time. 120 km ÷ 2 hours = 60 km/h. The average speed is 60 km/h."

Student learns to: 1. Generate similar step-by-step reasoning 2. Arrive at correct answers through explicit reasoning 3. Show its work (unlike direct answer training)

Result: Small models gain reasoning they couldn't learn from answers alone``

Distillation Scaling

| Teacher Size | Student Size | Quality Retention | Use Case |
|-------------|-------------|-------------------|----------|
| 70B → 7B | 10:1 | 85-95% | General deployment |
| 400B → 7B | 57:1 | 70-85% | Cost-sensitive |
| 70B → 1.5B | 47:1 | 65-80% | Edge/mobile |
| Ensemble → single | N:1 | 95-100% | Serving efficiency |

Limitations and Concerns

- Terms of service: Many API providers prohibit using outputs for competitive model training.
- Capability ceiling: Student rarely exceeds teacher quality on any individual task.
- Brittleness: Distilled models may lack robustness outside training distribution.
- Benchmark leakage: Teacher may have memorized benchmark answers → inflated student scores.

LLM distillation is the bridge between frontier model capabilities and practical deployment — by transferring knowledge from massive teacher models into efficient students through carefully curated synthetic data and reasoning chains, distillation enables organizations to deploy models with near-frontier quality at 10-100× lower inference cost, making advanced AI capabilities accessible for production applications where running a 400B parameter model is impractical.

Want to learn more?