Home Knowledge Base LLM Distillation

LLM Distillation is the process of training a smaller student language model to mimic the behavior of a larger teacher model — using the teacher's output distributions, reasoning chains, or generated training data to transfer capabilities that would normally require massive scale, enabling models with 1-10B parameters to achieve performance approaching much larger 70B-400B models at a fraction of the inference cost, making distillation the primary technique behind efficient deployment-ready models.

Distillation Approaches for LLMs

ApproachWhat's TransferredData RequiredEffectiveness
Logit distillationFull output probability distributionNone (forward pass teacher)Highest quality
Chain-of-thought distillationReasoning steps from teacherGenerated CoT dataStrong for reasoning
Synthetic data distillationTeacher-generated training examplesGenerated Q&A pairsMost practical
Feature distillationIntermediate layer representationsNone (forward pass)Moderate
Preference distillationTeacher preference rankingsPairwise comparisonsGood for alignment

Logit-Based Distillation

Standard training:
  Student loss = CrossEntropy(student_logits, hard_labels)
  Only learns: correct answer = 1, everything else = 0

Knowledge distillation:
  Student loss = α × CE(student_logits, hard_labels)
                + β × KL(softmax(student_logits/T), softmax(teacher_logits/T))
  Learns: Full distribution — "cat" is 70% likely, "kitten" 15%, "dog" 3%...
  Dark knowledge: Relative probabilities of wrong answers carry structure

Synthetic Data Distillation (Most Common for LLMs)

Step 1: Generate training data using teacher
  Teacher (GPT-4 / Claude) generates:
  - Instruction-response pairs
  - Multi-turn conversations
  - Chain-of-thought reasoning
  - Code solutions with explanations

Step 2: Filter generated data
  - Remove incorrect/low-quality responses
  - Decontaminate for benchmark fairness
  - Diverse topic sampling

Step 3: Fine-tune student on teacher data
  Student (7B model) → SFT on teacher-generated data
  Often 100K-1M examples sufficient

Notable Distilled Models

StudentTeacherSize RatioPerformanceMethod
Alpaca (7B)text-davinci-00326× smallerGood for chat52K synthetic examples
Vicuna (13B)ChatGPT10× smaller90% of ChatGPT quality70K ShareGPT conversations
Phi-1.5 (1.3B)GPT-4 (synthetic)1000× smaller≈ Llama-7B30B synthetic tokens
Orca 2 (7B)GPT-4200× smaller≈ ChatGPTExplanation tuning
DeepSeek-R1-DistillDeepSeek-R110-100× smallerStrong reasoningCoT distillation

Chain-of-Thought Distillation

Teacher generates reasoning chains:
  Q: "If a train travels 120 km in 2 hours, what is its average speed?"
  Teacher CoT: "To find average speed, I divide total distance by total time.
                120 km ÷ 2 hours = 60 km/h.
                The average speed is 60 km/h."

Student learns to:
  1. Generate similar step-by-step reasoning
  2. Arrive at correct answers through explicit reasoning
  3. Show its work (unlike direct answer training)

Result: Small models gain reasoning they couldn't learn from answers alone

Distillation Scaling

Teacher SizeStudent SizeQuality RetentionUse Case
70B → 7B10:185-95%General deployment
400B → 7B57:170-85%Cost-sensitive
70B → 1.5B47:165-80%Edge/mobile
Ensemble → singleN:195-100%Serving efficiency

Limitations and Concerns

LLM distillation is the bridge between frontier model capabilities and practical deployment — by transferring knowledge from massive teacher models into efficient students through carefully curated synthetic data and reasoning chains, distillation enables organizations to deploy models with near-frontier quality at 10-100× lower inference cost, making advanced AI capabilities accessible for production applications where running a 400B parameter model is impractical.

distilled modelmodel distillation llmteacher student llmdistillation training datadistilled language model

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.