Reward Models and Preference Learning
What is a Reward Model? A model trained to predict human preferences, used to guide LLM training via RLHF.
Preference Data Collection
Prompt: "Explain photosynthesis"
Response A: [detailed explanation]
Response B: [brief explanation]
Human preference: A > B (A is better)
Training Reward Model The reward model learns from pairwise comparisons:
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.backbone = base_model
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids):
hidden = self.backbone(input_ids).last_hidden_state[:, -1]
return self.reward_head(hidden)
# Bradley-Terry loss for pairwise preferences
def preference_loss(reward_chosen, reward_rejected):
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected))
Data Collection Methods
| Method | Description |
|---|---|
| Pairwise comparison | A vs B, which is better |
| Rating scale | Rate 1-5 |
| Ranking | Order multiple responses |
| Best-of-N | Pick best from N options |
Reward Model Training
# Training loop
for batch in dataloader:
chosen = batch["chosen"] # Preferred response
rejected = batch["rejected"] # Less preferred
r_chosen = reward_model(chosen)
r_rejected = reward_model(rejected)
loss = preference_loss(r_chosen, r_rejected)
loss.backward()
optimizer.step()
Using Reward Model in RLHF
1. Generate response from LLM
2. Score with reward model
3. Use score as RL reward
4. Update LLM with PPO
Challenges
| Challenge | Mitigation |
|---|---|
| Reward hacking | Regularize, diverse prompts |
| Annotation quality | Multiple annotators, guidelines |
| Distribution shift | Retrain on new model outputs |
| Mode collapse | KL penalty to reference model |
DPO Alternative Direct Preference Optimization skips explicit reward model:
# DPO loss (simplified)
log_ratio_chosen = log_prob_policy(chosen) - log_prob_ref(chosen)
log_ratio_rejected = log_prob_policy(rejected) - log_prob_ref(rejected)
loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
Best Practices
- Collect high-quality preference data
- Train on diverse prompts
- Monitor for reward hacking
- Combine with other alignment techniques
- Iterate on annotation guidelines
reward modelpreferenceranking
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.