Home Knowledge Base LLM alignment

LLM alignment is the process of training language models to behave in accordance with human values and intentions — using techniques like RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) to make models helpful, harmless, and honest, ensuring AI systems do what users actually want rather than just predicting the next token.

What Is Alignment?

Why Alignment Matters

The Alignment Problem

Base Model Behavior:

Prompt: "How do I pick a lock?"

Base Model (unaligned):
→ Has seen lockpicking instructions in training data
→ May helpfully provide detailed instructions
→ No concept of "should I answer this?"

Aligned Model:
→ Considers potential harm of response
→ May refuse or provide only legal context
→ Balances helpfulness with safety

Alignment Methods

Supervised Fine-Tuning (SFT):

RLHF (Reinforcement Learning from Human Feedback):

Step 1: Collect comparisons
   Prompt → Response A vs Response B
   Human labels which is better

Step 2: Train reward model
   Reward(prompt, response) → score
   Predicts human preference

Step 3: Optimize policy
   Use PPO to maximize reward
   Policy = original model + value head
   Iterate with fresh feedback

DPO (Direct Preference Optimization):

Insight: Skip reward model, directly use preferences

Loss = -log σ(β × (log π(y_w|x)/π_ref(y_w|x) 
                  - log π(y_l|x)/π_ref(y_l|x)))

y_w = preferred response
y_l = dis-preferred response

Simpler, often matches RLHF quality

Constitutional AI (CAI):

1. Generate response to harmful prompt
2. Critique: "Does this response violate [principle]?"
3. Revise: "Write a response that doesn't..."
4. Fine-tune on revised responses
5. RLHF with AI feedback (RLAIF)

Principles: List of behavioral guidelines
Reduces need for human labeling

Alignment Comparison

Method       | Human Data  | Complexity | Quality
-------------|-------------|------------|----------
SFT          | Demos       | Simple     | Baseline
RLHF         | Comparisons | Complex    | Best
DPO          | Comparisons | Medium     | Near RLHF
CAI/RLAIF    | Principles  | Medium     | Good

Challenges in Alignment

Current State

LLM alignment is the critical challenge for beneficial AI — getting powerful AI systems to reliably do what we want, avoid what we don't want, and behave ethically is essential for AI to be a positive force, making alignment research one of the most important areas in AI development.

alignmentrlhfdpopreferenceshuman feedbackconstitutional aihelpful harmless honest

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.