Continual Pretraining (Domain-Adaptive Pretraining)

Continual Pretraining (Domain-Adaptive Pretraining) is the technique of further training a general-purpose pretrained language model on a large corpus of domain-specific text — such as biomedical literature, legal documents, financial filings, or code — to adapt the model's representations and knowledge to the target domain before task-specific fine-tuning, significantly improving performance on domain-specific tasks compared to using the general model directly.

Why Continual Pretraining?

``General LLM (Llama, Mistral) → Good at general knowledge → Weak on specialized terminology, conventions, facts Continual Pretraining on domain corpus: → Adapts vocabulary distribution to domain → Encodes domain-specific knowledge and reasoning patterns → Maintains general capabilities (with care)

Result: Domain-adapted base model → much better domain fine-tuning results`

Evidence: DAPT (Gururangan et al., 2020)

Showed that continued pretraining on domain text before fine-tuning improves downstream task performance across domains: - Biomedical: +3.2% on ChemProt, +3.8% on RCT - Computer Science: +2.1% on SciERC, +2.9% on ACL-ARC - Even when the downstream labeled data is limited

Practical Implementation

`python # Continual pretraining recipe 1. Corpus preparation: - Collect large domain corpus (10B-100B+ tokens) - Clean, deduplicate, quality filter - Mix with small fraction of general data (5-20%) to prevent catastrophic forgetting

2. Training: - Start from pretrained checkpoint - Continue causal LM (next-token prediction) training - Lower learning rate than original pretraining (10-50× lower) - Typically 1-3 epochs over domain corpus - Constant or cosine LR schedule with warmup

3. Post-training: - Domain SFT on instruction data - Optional domain RLHF/DPO alignment``

Key Design Decisions

| Decision | Options | Impact |
|----------|---------|--------|
| Data mix ratio | Pure domain vs. domain + general | Too much domain → catastrophic forgetting |
| Learning rate | 1e-5 to 5e-5 (much lower than pretraining) | Too high → forget, too low → slow adaptation |
| Tokenizer | Keep original vs. extend vocabulary | Domain tokens may be poorly tokenized |
| Token budget | 10B-100B+ domain tokens | More = better adaptation, diminishing returns |
| Replay | Include general data replay | Critical for maintaining general skills |

Vocabulary Adaptation

Domain text may contain tokens poorly represented in the general tokenizer (e.g., chemical formulas, legal citations, code syntax). Options:
- Keep original tokenizer: Some domain tokens become multi-token sequences (inefficient but simple)
- Extend tokenizer: Add domain-specific tokens, initialize new embeddings (average of subword embeddings or random), train longer
- Replace tokenizer: Retrain BPE on domain corpus — most disruptive, requires extensive continued pretraining

Notable Domain-Adapted Models

| Model | Base | Domain | Corpus |
|-------|------|--------|--------|
| BioMistral | Mistral-7B | Biomedical | PubMed abstracts |
| SaulLM | Mistral-7B | Legal | Legal-MC4, legal documents |
| CodeLlama | Llama 2 | Code | 500B code tokens |
| MedPaLM | PaLM | Medical | Medical textbooks, notes |
| BloombergGPT | Bloom | Finance | Bloomberg terminal data |
| StarCoder 2 | Scratch | Code | The Stack v2 |

Catastrophic Forgetting Mitigation

- Data replay: Mix 10-20% general data with domain data during continued pretraining
- Low learning rate: Limits how far weights move from the general checkpoint
- Elastic weight consolidation (EWC): Penalize large changes to parameters important for general tasks
- Progressive training: Gradually increase domain data ratio during training

Continual pretraining is the standard recipe for building domain-specialist LLMs — by adapting the model's internal representations to domain-specific language, knowledge, and reasoning patterns before fine-tuning, it achieves substantially better domain performance than fine-tuning alone, while being far more cost-effective than training a domain model from scratch.

Continual Pretraining (Domain-Adaptive Pretraining)

Want to learn more?