GRPO and RL for LLM Reasoning is the reinforcement learning training paradigm that directly optimizes large language models for verifiable reasoning tasks — particularly mathematical problem solving and code generation, using reward signals derived from solution correctness rather than human preference ratings, with GRPO (Group Relative Policy Optimization) emerging as a computationally efficient alternative to PPO that eliminates the value function critic, enabling DeepSeek-R1 and similar models to achieve frontier mathematical reasoning.
Motivation: Beyond RLHF for Reasoning
- Standard RLHF: Human rates responses → reward model → PPO → better responses.
- Problem: Human raters cannot reliably evaluate complex math proofs or long code.
- Reasoning RL: Use verifiable rewards — math answer correct or not, code passes tests or not.
- Key insight: Verifiable tasks have binary/objective rewards → no human bottleneck.
GRPO (Group Relative Policy Optimization, DeepSeek)
- Eliminates value function (critic) network → reduces memory and compute.
- For each question q, sample G outputs {o_1, ..., o_G} from policy π_θ.
- Compute reward r_i for each output (rule-based: correct answer = +1, wrong = 0, format = small bonus).
- Group relative advantage: A_i = (r_i - mean(r)) / std(r) → normalize within group.
- Policy gradient with clipped objective (similar to PPO clip):
````
L_GRPO = E[min(
(π_θ(o|q) / π_θ_old(o|q)) × A,
clip((π_θ(o|q) / π_θ_old(o|q)), 1-ε, 1+ε) × A
)] - β × KL(π_θ || π_ref)
- KL penalty: Prevents too much deviation from SFT reference model.
- G=8–16 outputs per question; advantage normalized across group → stable training.
DeepSeek-R1 Training Pipeline
1. Cold start: SFT on small curated chain-of-thought data (few thousand examples).
2. GRPO reasoning RL: Large-scale RL on math + code with rule-based rewards → emerge "thinking" behavior.
3. Rejection sampling SFT: Generate many outputs → keep correct ones → fine-tune on correct trajectories.
4. RLHF stage: Add human preference rewards for safety + helpfulness → final model.
Emergent Thinking Behaviors
- Models trained with GRPO spontaneously learn to:
- Self-verify: "Let me check this answer..."
- Backtrack: "This approach doesn't work, let me try differently..."
- Explore alternatives: "Another way to solve this..."
- These reasoning patterns are NOT explicitly trained → emerge from reward signal alone.
- Analogous to how RL taught AlphaGo to discover novel Go strategies.
Process Reward Models (PRMs)
- Standard reward: Only correct final answer gets reward → sparse signal.
- PRM: Reward each step of the reasoning process → dense signal → better credit assignment.
- PRM training: Label which reasoning steps are correct (human labelers or automatic via step-checking).
- Math-Shepherd: Generate many solution trees → label via outcome verification → train PRM.
- PRM advantage: Penalizes wrong reasoning steps even if final answer happens to be correct.
Comparison: PPO vs GRPO
| Aspect | PPO | GRPO |
|--------|-----|------|
| Critic network | Required (large memory) | Eliminated |
| Advantage estimation | GAE from value function | Group relative normalization |
| Compute | 2× model (actor + critic) | 1× model |
| Stability | Well-studied | Equally stable for reasoning |
Results
- DeepSeek-R1 (671B MoE): Matches o1-preview on AIME 2024, MATH-500.
- DeepSeek-R1-Zero (RL only, no SFT): 71% on AIME → demonstrates reasoning emerges from RL alone.
- Smaller models (1.5B–32B) distilled from R1 → strong reasoning in efficient packages.
GRPO and RL for reasoning are the training paradigm that unlocks chain-of-thought reasoning as a learnable, improvable skill rather than a fixed capability — by providing models with verifiable rewards for correct reasoning steps and optimizing them with group-relative policy gradients, these methods produce models that spontaneously develop human-like problem-solving strategies including self-correction and alternative approach exploration, suggesting that human-level mathematical reasoning is achievable through reinforcement learning at scale without requiring hard-coded reasoning algorithms or millions of human annotations.