LLM Distillation

Keywords: distilled model,model distillation llm,teacher student llm,distillation training data,distilled language model

LLM Distillation is the process of training a smaller student language model to mimic the behavior of a larger teacher model โ€” using the teacher's output distributions, reasoning chains, or generated training data to transfer capabilities that would normally require massive scale, enabling models with 1-10B parameters to achieve performance approaching much larger 70B-400B models at a fraction of the inference cost, making distillation the primary technique behind efficient deployment-ready models.

Distillation Approaches for LLMs

| Approach | What's Transferred | Data Required | Effectiveness |
|----------|-------------------|-------------|---------------|
| Logit distillation | Full output probability distribution | None (forward pass teacher) | Highest quality |
| Chain-of-thought distillation | Reasoning steps from teacher | Generated CoT data | Strong for reasoning |
| Synthetic data distillation | Teacher-generated training examples | Generated Q&A pairs | Most practical |
| Feature distillation | Intermediate layer representations | None (forward pass) | Moderate |
| Preference distillation | Teacher preference rankings | Pairwise comparisons | Good for alignment |

Logit-Based Distillation

``
Standard training:
Student loss = CrossEntropy(student_logits, hard_labels)
Only learns: correct answer = 1, everything else = 0

Knowledge distillation:
Student loss = ฮฑ ร— CE(student_logits, hard_labels)
+ ฮฒ ร— KL(softmax(student_logits/T), softmax(teacher_logits/T))
Learns: Full distribution โ€” "cat" is 70% likely, "kitten" 15%, "dog" 3%...
Dark knowledge: Relative probabilities of wrong answers carry structure
`

Synthetic Data Distillation (Most Common for LLMs)

`
Step 1: Generate training data using teacher
Teacher (GPT-4 / Claude) generates:
- Instruction-response pairs
- Multi-turn conversations
- Chain-of-thought reasoning
- Code solutions with explanations

Step 2: Filter generated data
- Remove incorrect/low-quality responses
- Decontaminate for benchmark fairness
- Diverse topic sampling

Step 3: Fine-tune student on teacher data
Student (7B model) โ†’ SFT on teacher-generated data
Often 100K-1M examples sufficient
`

Notable Distilled Models

| Student | Teacher | Size Ratio | Performance | Method |
|---------|---------|-----------|------------|--------|
| Alpaca (7B) | text-davinci-003 | 26ร— smaller | Good for chat | 52K synthetic examples |
| Vicuna (13B) | ChatGPT | 10ร— smaller | 90% of ChatGPT quality | 70K ShareGPT conversations |
| Phi-1.5 (1.3B) | GPT-4 (synthetic) | 1000ร— smaller | โ‰ˆ Llama-7B | 30B synthetic tokens |
| Orca 2 (7B) | GPT-4 | 200ร— smaller | โ‰ˆ ChatGPT | Explanation tuning |
| DeepSeek-R1-Distill | DeepSeek-R1 | 10-100ร— smaller | Strong reasoning | CoT distillation |

Chain-of-Thought Distillation

`
Teacher generates reasoning chains:
Q: "If a train travels 120 km in 2 hours, what is its average speed?"
Teacher CoT: "To find average speed, I divide total distance by total time.
120 km รท 2 hours = 60 km/h.
The average speed is 60 km/h."

Student learns to:
1. Generate similar step-by-step reasoning
2. Arrive at correct answers through explicit reasoning
3. Show its work (unlike direct answer training)

Result: Small models gain reasoning they couldn't learn from answers alone
``

Distillation Scaling

| Teacher Size | Student Size | Quality Retention | Use Case |
|-------------|-------------|-------------------|----------|
| 70B โ†’ 7B | 10:1 | 85-95% | General deployment |
| 400B โ†’ 7B | 57:1 | 70-85% | Cost-sensitive |
| 70B โ†’ 1.5B | 47:1 | 65-80% | Edge/mobile |
| Ensemble โ†’ single | N:1 | 95-100% | Serving efficiency |

Limitations and Concerns

- Terms of service: Many API providers prohibit using outputs for competitive model training.
- Capability ceiling: Student rarely exceeds teacher quality on any individual task.
- Brittleness: Distilled models may lack robustness outside training distribution.
- Benchmark leakage: Teacher may have memorized benchmark answers โ†’ inflated student scores.

LLM distillation is the bridge between frontier model capabilities and practical deployment โ€” by transferring knowledge from massive teacher models into efficient students through carefully curated synthetic data and reasoning chains, distillation enables organizations to deploy models with near-frontier quality at 10-100ร— lower inference cost, making advanced AI capabilities accessible for production applications where running a 400B parameter model is impractical.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT