GRPO and RL for LLM Reasoning

Keywords: grpo,group relative policy optimization,llm reward free rl,process reward model training,math reasoning rl

GRPO and RL for LLM Reasoning is the reinforcement learning training paradigm that directly optimizes large language models for verifiable reasoning tasks — particularly mathematical problem solving and code generation, using reward signals derived from solution correctness rather than human preference ratings, with GRPO (Group Relative Policy Optimization) emerging as a computationally efficient alternative to PPO that eliminates the value function critic, enabling DeepSeek-R1 and similar models to achieve frontier mathematical reasoning.

Motivation: Beyond RLHF for Reasoning

- Standard RLHF: Human rates responses → reward model → PPO → better responses.
- Problem: Human raters cannot reliably evaluate complex math proofs or long code.
- Reasoning RL: Use verifiable rewards — math answer correct or not, code passes tests or not.
- Key insight: Verifiable tasks have binary/objective rewards → no human bottleneck.

GRPO (Group Relative Policy Optimization, DeepSeek)

- Eliminates value function (critic) network → reduces memory and compute.
- For each question q, sample G outputs {o_1, ..., o_G} from policy π_θ.
- Compute reward r_i for each output (rule-based: correct answer = +1, wrong = 0, format = small bonus).
- Group relative advantage: A_i = (r_i - mean(r)) / std(r) → normalize within group.
- Policy gradient with clipped objective (similar to PPO clip):

``
L_GRPO = E[min(
(π_θ(o|q) / π_θ_old(o|q)) × A,
clip((π_θ(o|q) / π_θ_old(o|q)), 1-ε, 1+ε) × A
)] - β × KL(π_θ || π_ref)
``

- KL penalty: Prevents too much deviation from SFT reference model.
- G=8–16 outputs per question; advantage normalized across group → stable training.

DeepSeek-R1 Training Pipeline

1. Cold start: SFT on small curated chain-of-thought data (few thousand examples).
2. GRPO reasoning RL: Large-scale RL on math + code with rule-based rewards → emerge "thinking" behavior.
3. Rejection sampling SFT: Generate many outputs → keep correct ones → fine-tune on correct trajectories.
4. RLHF stage: Add human preference rewards for safety + helpfulness → final model.

Emergent Thinking Behaviors

- Models trained with GRPO spontaneously learn to:
- Self-verify: "Let me check this answer..."
- Backtrack: "This approach doesn't work, let me try differently..."
- Explore alternatives: "Another way to solve this..."
- These reasoning patterns are NOT explicitly trained → emerge from reward signal alone.
- Analogous to how RL taught AlphaGo to discover novel Go strategies.

Process Reward Models (PRMs)

- Standard reward: Only correct final answer gets reward → sparse signal.
- PRM: Reward each step of the reasoning process → dense signal → better credit assignment.
- PRM training: Label which reasoning steps are correct (human labelers or automatic via step-checking).
- Math-Shepherd: Generate many solution trees → label via outcome verification → train PRM.
- PRM advantage: Penalizes wrong reasoning steps even if final answer happens to be correct.

Comparison: PPO vs GRPO

| Aspect | PPO | GRPO |
|--------|-----|------|
| Critic network | Required (large memory) | Eliminated |
| Advantage estimation | GAE from value function | Group relative normalization |
| Compute | 2× model (actor + critic) | 1× model |
| Stability | Well-studied | Equally stable for reasoning |

Results

- DeepSeek-R1 (671B MoE): Matches o1-preview on AIME 2024, MATH-500.
- DeepSeek-R1-Zero (RL only, no SFT): 71% on AIME → demonstrates reasoning emerges from RL alone.
- Smaller models (1.5B–32B) distilled from R1 → strong reasoning in efficient packages.

GRPO and RL for reasoning are the training paradigm that unlocks chain-of-thought reasoning as a learnable, improvable skill rather than a fixed capability — by providing models with verifiable rewards for correct reasoning steps and optimizing them with group-relative policy gradients, these methods produce models that spontaneously develop human-like problem-solving strategies including self-correction and alternative approach exploration, suggesting that human-level mathematical reasoning is achievable through reinforcement learning at scale without requiring hard-coded reasoning algorithms or millions of human annotations.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT