Home Knowledge Base Reward Models and Preference Learning

Reward Models and Preference Learning

What is a Reward Model? A model trained to predict human preferences, used to guide LLM training via RLHF.

Preference Data Collection

Prompt: "Explain photosynthesis"

Response A: [detailed explanation]
Response B: [brief explanation]

Human preference: A > B (A is better)

Training Reward Model The reward model learns from pairwise comparisons:

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.backbone = base_model
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids):
        hidden = self.backbone(input_ids).last_hidden_state[:, -1]
        return self.reward_head(hidden)

# Bradley-Terry loss for pairwise preferences
def preference_loss(reward_chosen, reward_rejected):
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected))

Data Collection Methods

MethodDescription
Pairwise comparisonA vs B, which is better
Rating scaleRate 1-5
RankingOrder multiple responses
Best-of-NPick best from N options

Reward Model Training

# Training loop
for batch in dataloader:
    chosen = batch["chosen"]  # Preferred response
    rejected = batch["rejected"]  # Less preferred

    r_chosen = reward_model(chosen)
    r_rejected = reward_model(rejected)

    loss = preference_loss(r_chosen, r_rejected)
    loss.backward()
    optimizer.step()

Using Reward Model in RLHF

1. Generate response from LLM
2. Score with reward model
3. Use score as RL reward
4. Update LLM with PPO

Challenges

ChallengeMitigation
Reward hackingRegularize, diverse prompts
Annotation qualityMultiple annotators, guidelines
Distribution shiftRetrain on new model outputs
Mode collapseKL penalty to reference model

DPO Alternative Direct Preference Optimization skips explicit reward model:

# DPO loss (simplified)
log_ratio_chosen = log_prob_policy(chosen) - log_prob_ref(chosen)
log_ratio_rejected = log_prob_policy(rejected) - log_prob_ref(rejected)

loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))

Best Practices

reward modelpreferenceranking

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.