Home Knowledge Base Continual Pretraining (Domain-Adaptive Pretraining)

Continual Pretraining (Domain-Adaptive Pretraining) is the technique of further training a general-purpose pretrained language model on a large corpus of domain-specific text — such as biomedical literature, legal documents, financial filings, or code — to adapt the model's representations and knowledge to the target domain before task-specific fine-tuning, significantly improving performance on domain-specific tasks compared to using the general model directly.

Why Continual Pretraining?

General LLM (Llama, Mistral)
  → Good at general knowledge
  → Weak on specialized terminology, conventions, facts
  
Continual Pretraining on domain corpus:
  → Adapts vocabulary distribution to domain
  → Encodes domain-specific knowledge and reasoning patterns
  → Maintains general capabilities (with care)

Result: Domain-adapted base model → much better domain fine-tuning results

Evidence: DAPT (Gururangan et al., 2020)

Showed that continued pretraining on domain text before fine-tuning improves downstream task performance across domains:

Practical Implementation

# Continual pretraining recipe
1. Corpus preparation:
   - Collect large domain corpus (10B-100B+ tokens)
   - Clean, deduplicate, quality filter
   - Mix with small fraction of general data (5-20%) to prevent catastrophic forgetting

2. Training:
   - Start from pretrained checkpoint
   - Continue causal LM (next-token prediction) training
   - Lower learning rate than original pretraining (10-50× lower)
   - Typically 1-3 epochs over domain corpus
   - Constant or cosine LR schedule with warmup

3. Post-training:
   - Domain SFT on instruction data
   - Optional domain RLHF/DPO alignment

Key Design Decisions

DecisionOptionsImpact
Data mix ratioPure domain vs. domain + generalToo much domain → catastrophic forgetting
Learning rate1e-5 to 5e-5 (much lower than pretraining)Too high → forget, too low → slow adaptation
TokenizerKeep original vs. extend vocabularyDomain tokens may be poorly tokenized
Token budget10B-100B+ domain tokensMore = better adaptation, diminishing returns
ReplayInclude general data replayCritical for maintaining general skills

Vocabulary Adaptation

Domain text may contain tokens poorly represented in the general tokenizer (e.g., chemical formulas, legal citations, code syntax). Options:

Notable Domain-Adapted Models

ModelBaseDomainCorpus
BioMistralMistral-7BBiomedicalPubMed abstracts
SaulLMMistral-7BLegalLegal-MC4, legal documents
CodeLlamaLlama 2Code500B code tokens
MedPaLMPaLMMedicalMedical textbooks, notes
BloombergGPTBloomFinanceBloomberg terminal data
StarCoder 2ScratchCodeThe Stack v2

Catastrophic Forgetting Mitigation

Continual pretraining is the standard recipe for building domain-specialist LLMs — by adapting the model's internal representations to domain-specific language, knowledge, and reasoning patterns before fine-tuning, it achieves substantially better domain performance than fine-tuning alone, while being far more cost-effective than training a domain model from scratch.

continual pretrainingdomain adaptive pretrainingDAPTcontinued trainingLLM domain adaptation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.