Reinforcement Learning from Human Feedback (RLHF) is the alignment training methodology that fine-tunes large language models to follow human instructions, be helpful, and avoid harmful outputs — by first training a reward model on human preference judgments, then using reinforcement learning (PPO) to optimize the LLM's policy to maximize the learned reward while staying close to the pre-trained distribution.
The Three Stages of RLHF
Stage 1: Supervised Fine-Tuning (SFT)
A pre-trained base model is fine-tuned on high-quality demonstrations of desired behavior — human-written responses to diverse prompts covering instruction following, question answering, creative writing, coding, and refusal of harmful requests. This gives the model basic instruction-following ability.
Stage 2: Reward Model Training
Human annotators compare pairs of model responses to the same prompt and indicate which response is better. A reward model (typically the same architecture as the LLM, with a scalar output head) is trained to predict human preferences using the Bradley-Terry model: P(y_w > y_l) = sigma(r(y_w) - r(y_l)). This model learns a numerical score that correlates with human quality judgments.
Stage 3: RL Optimization (PPO)
The SFT model is further trained using Proximal Policy Optimization to maximize the reward model's score while minimizing KL divergence from the SFT model (preventing the policy from "gaming" the reward model by generating adversarial outputs that score high but are low quality):
objective = E[r_theta(x,y) - beta * KL(pi_rl || pi_sft)]
The KL penalty beta controls the exploration-exploitation tradeoff.
Why RLHF Works
Human preferences are easier to collect than demonstrations. It's hard for annotators to write a perfect response, but easy to say "Response A is better than Response B." This comparative signal, amplified through the reward model, teaches the LLM nuanced quality distinctions that demonstration data alone cannot capture — subtleties of tone, completeness, safety, and helpfulness.
Challenges
- Reward Hacking: The policy finds outputs that score high on the reward model but are not genuinely good (verbose, sycophantic, or repetitive responses). The KL constraint mitigates this but doesn't eliminate it.
- Annotation Quality: Human preferences are noisy, biased, and inconsistent across annotators. Inter-annotator agreement is often only 60-75%, putting a ceiling on reward model accuracy.
- Training Instability: PPO is notoriously sensitive to hyperparameters. The interplay between the policy, reward model, and KL constraint creates a complex optimization landscape.
Constitutional AI (CAI)
Anthropic's approach replaces human annotators with AI self-critique. The model generates responses, critiques them against a set of principles ("constitution"), and revises them. Preference pairs are generated by comparing original and revised responses. This scales annotation beyond human bandwidth while maintaining alignment with explicit principles.
Alternatives and Evolution
DPO, KTO, ORPO, and other methods simplify RLHF by removing the explicit reward model and/or RL loop. However, the full RLHF pipeline (with a trained reward model) remains the gold standard for the most capable frontier models.
RLHF is the training methodology that transformed raw language models into the helpful, harmless assistants the world now uses daily — bridging the gap between "predicts the next token" and "answers your question thoughtfully and safely."