Continual Pretraining (Domain-Adaptive Pretraining) is the technique of further training a general-purpose pretrained language model on a large corpus of domain-specific text — such as biomedical literature, legal documents, financial filings, or code — to adapt the model's representations and knowledge to the target domain before task-specific fine-tuning, significantly improving performance on domain-specific tasks compared to using the general model directly.
Why Continual Pretraining?
``
General LLM (Llama, Mistral)
→ Good at general knowledge
→ Weak on specialized terminology, conventions, facts
Continual Pretraining on domain corpus:
→ Adapts vocabulary distribution to domain
→ Encodes domain-specific knowledge and reasoning patterns
→ Maintains general capabilities (with care)
Result: Domain-adapted base model → much better domain fine-tuning results
`
Evidence: DAPT (Gururangan et al., 2020)
Showed that continued pretraining on domain text before fine-tuning improves downstream task performance across domains:
- Biomedical: +3.2% on ChemProt, +3.8% on RCT
- Computer Science: +2.1% on SciERC, +2.9% on ACL-ARC
- Even when the downstream labeled data is limited
Practical Implementation
`python
# Continual pretraining recipe
1. Corpus preparation:
- Collect large domain corpus (10B-100B+ tokens)
- Clean, deduplicate, quality filter
- Mix with small fraction of general data (5-20%) to prevent catastrophic forgetting
2. Training:
- Start from pretrained checkpoint
- Continue causal LM (next-token prediction) training
- Lower learning rate than original pretraining (10-50× lower)
- Typically 1-3 epochs over domain corpus
- Constant or cosine LR schedule with warmup
3. Post-training:
- Domain SFT on instruction data
- Optional domain RLHF/DPO alignment
``
Key Design Decisions
| Decision | Options | Impact |
|----------|---------|--------|
| Data mix ratio | Pure domain vs. domain + general | Too much domain → catastrophic forgetting |
| Learning rate | 1e-5 to 5e-5 (much lower than pretraining) | Too high → forget, too low → slow adaptation |
| Tokenizer | Keep original vs. extend vocabulary | Domain tokens may be poorly tokenized |
| Token budget | 10B-100B+ domain tokens | More = better adaptation, diminishing returns |
| Replay | Include general data replay | Critical for maintaining general skills |
Vocabulary Adaptation
Domain text may contain tokens poorly represented in the general tokenizer (e.g., chemical formulas, legal citations, code syntax). Options:
- Keep original tokenizer: Some domain tokens become multi-token sequences (inefficient but simple)
- Extend tokenizer: Add domain-specific tokens, initialize new embeddings (average of subword embeddings or random), train longer
- Replace tokenizer: Retrain BPE on domain corpus — most disruptive, requires extensive continued pretraining
Notable Domain-Adapted Models
| Model | Base | Domain | Corpus |
|-------|------|--------|--------|
| BioMistral | Mistral-7B | Biomedical | PubMed abstracts |
| SaulLM | Mistral-7B | Legal | Legal-MC4, legal documents |
| CodeLlama | Llama 2 | Code | 500B code tokens |
| MedPaLM | PaLM | Medical | Medical textbooks, notes |
| BloombergGPT | Bloom | Finance | Bloomberg terminal data |
| StarCoder 2 | Scratch | Code | The Stack v2 |
Catastrophic Forgetting Mitigation
- Data replay: Mix 10-20% general data with domain data during continued pretraining
- Low learning rate: Limits how far weights move from the general checkpoint
- Elastic weight consolidation (EWC): Penalize large changes to parameters important for general tasks
- Progressive training: Gradually increase domain data ratio during training
Continual pretraining is the standard recipe for building domain-specialist LLMs — by adapting the model's internal representations to domain-specific language, knowledge, and reasoning patterns before fine-tuning, it achieves substantially better domain performance than fine-tuning alone, while being far more cost-effective than training a domain model from scratch.