Curiosity · Concept
Group Relative Policy Optimization (GRPO)
Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath and scaled up in DeepSeek-R1, is a PPO-style RL algorithm specialised for language models. Instead of training a separate critic/value network, GRPO samples a group of responses for each prompt, scores them with a reward model or verifier, and uses the mean and standard deviation of rewards within the group as a baseline to compute per-token advantages. This eliminates the memory cost of a value network while still giving stable, low-variance policy updates — and it is the algorithm that made R1's 'aha moment' reasoning RL feasible.
Quick reference
- Proficiency
- Advanced
- Also known as
- GRPO, group relative PPO
- Prerequisites
- PPO, RLHF, Policy gradient methods
Frequently asked questions
What is GRPO?
GRPO (Group Relative Policy Optimization) is an RL algorithm for LLM post-training that replaces PPO's value network with a group-relative baseline: for each prompt you sample K completions, z-normalize their rewards, and use that as the advantage signal.
How does GRPO differ from PPO?
PPO trains a separate value network (critic) alongside the policy, roughly doubling memory. GRPO drops the critic and instead uses the statistics of K sampled responses per prompt as a built-in baseline, making it more memory-efficient for huge LLMs.
Why does GRPO matter for reasoning models?
Reasoning RL (like DeepSeek-R1-Zero) needs long chains of thought and many rollouts. GRPO's lower memory footprint and stable group-relative advantages let DeepSeek run large-scale RL on reasoning traces where PPO would be too expensive.
What kind of reward does GRPO need?
Any per-response scalar reward: a reward model (RLHF style), a rule-based verifier (exact-match on math/code), or a judge model. DeepSeek-R1 famously used verifiable rule rewards for math and code.
Sources
- Shao et al. — DeepSeekMath: Pushing the Limits of Mathematical Reasoning — accessed 2026-04-20
- DeepSeek-AI — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL — accessed 2026-04-20