Reward modeling

Reward modeling is the process of training a neural network to predict human preferences — creating a learned scoring function that can evaluate AI outputs the way a human evaluator would. It is the critical first step in RLHF (Reinforcement Learning from Human Feedback), providing the signal that guides the language model toward more helpful, harmless, and honest behavior.

How Reward Modeling Works

- Step 1 — Collect Comparisons: Human evaluators are shown pairs of model outputs for the same prompt and asked which response they prefer. This produces a dataset of (prompt, preferred response, rejected response) triples.
- Step 2 — Train the Reward Model: A neural network (typically initialized from the same pretrained LM) is trained to assign higher scores to preferred responses and lower scores to rejected ones, using a ranking loss.
- Step 3 — Deploy as Reward: The trained reward model serves as the optimization objective for the next RLHF stage — the policy model is trained to maximize the reward model's scores.

Key Design Decisions

- Architecture: Usually a transformer model with the final token's representation fed through a linear head to produce a scalar reward.
- Data Quality: The quality of the reward model depends heavily on consistent, high-quality human annotations. Noisy or inconsistent preferences degrade the reward signal.
- Overoptimization: If the policy model is optimized too aggressively against the reward model, it can learn to exploit quirks in the reward model rather than genuinely improving quality. KL divergence penalties help prevent this.

Challenges

- Reward Hacking: The policy finds outputs that score high on the reward model but aren't actually good by human standards.
- Distribution Shift: The reward model was trained on outputs from a base model but must evaluate outputs from the optimized policy, which may look very different.
- Scaling Annotations: Collecting high-quality human preferences is expensive and doesn't scale easily.

Reward modeling is used by OpenAI, Anthropic, Google, and virtually all major labs as the primary mechanism for aligning LLMs with human preferences.

Want to learn more?