Home Knowledge Base GRPO and RL for LLM Reasoning

GRPO and RL for LLM Reasoning is the reinforcement learning training paradigm that directly optimizes large language models for verifiable reasoning tasks — particularly mathematical problem solving and code generation, using reward signals derived from solution correctness rather than human preference ratings, with GRPO (Group Relative Policy Optimization) emerging as a computationally efficient alternative to PPO that eliminates the value function critic, enabling DeepSeek-R1 and similar models to achieve frontier mathematical reasoning.

Motivation: Beyond RLHF for Reasoning

GRPO (Group Relative Policy Optimization, DeepSeek)

L_GRPO = E[min(
    (π_θ(o|q) / π_θ_old(o|q)) × A,
    clip((π_θ(o|q) / π_θ_old(o|q)), 1-ε, 1+ε) × A
)] - β × KL(π_θ || π_ref)

DeepSeek-R1 Training Pipeline

1. Cold start: SFT on small curated chain-of-thought data (few thousand examples). 2. GRPO reasoning RL: Large-scale RL on math + code with rule-based rewards → emerge "thinking" behavior. 3. Rejection sampling SFT: Generate many outputs → keep correct ones → fine-tune on correct trajectories. 4. RLHF stage: Add human preference rewards for safety + helpfulness → final model.

Emergent Thinking Behaviors

Process Reward Models (PRMs)

Comparison: PPO vs GRPO

AspectPPOGRPO
Critic networkRequired (large memory)Eliminated
Advantage estimationGAE from value functionGroup relative normalization
Compute2× model (actor + critic)1× model
StabilityWell-studiedEqually stable for reasoning

Results

GRPO and RL for reasoning are the training paradigm that unlocks chain-of-thought reasoning as a learnable, improvable skill rather than a fixed capability — by providing models with verifiable rewards for correct reasoning steps and optimizing them with group-relative policy gradients, these methods produce models that spontaneously develop human-like problem-solving strategies including self-correction and alternative approach exploration, suggesting that human-level mathematical reasoning is achievable through reinforcement learning at scale without requiring hard-coded reasoning algorithms or millions of human annotations.

grpogroup relative policy optimizationllm reward free rlprocess reward model trainingmath reasoning rl

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.