Curiosity · Concept

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath and scaled up in DeepSeek-R1, is a PPO-style RL algorithm specialised for language models. Instead of training a separate critic/value network, GRPO samples a group of responses for each prompt, scores them with a reward model or verifier, and uses the mean and standard deviation of rewards within the group as a baseline to compute per-token advantages. This eliminates the memory cost of a value network while still giving stable, low-variance policy updates — and it is the algorithm that made R1's 'aha moment' reasoning RL feasible.

Quick reference

Proficiency: Advanced
Also known as: GRPO, group relative PPO
Prerequisites: PPO, RLHF, Policy gradient methods

Frequently asked questions

What is GRPO?

GRPO (Group Relative Policy Optimization) is an RL algorithm for LLM post-training that replaces PPO's value network with a group-relative baseline: for each prompt you sample K completions, z-normalize their rewards, and use that as the advantage signal.

How does GRPO differ from PPO?

PPO trains a separate value network (critic) alongside the policy, roughly doubling memory. GRPO drops the critic and instead uses the statistics of K sampled responses per prompt as a built-in baseline, making it more memory-efficient for huge LLMs.

Why does GRPO matter for reasoning models?

Reasoning RL (like DeepSeek-R1-Zero) needs long chains of thought and many rollouts. GRPO's lower memory footprint and stable group-relative advantages let DeepSeek run large-scale RL on reasoning traces where PPO would be too expensive.

What kind of reward does GRPO need?

Any per-response scalar reward: a reward model (RLHF style), a rule-based verifier (exact-match on math/code), or a judge model. DeepSeek-R1 famously used verifiable rule rewards for math and code.

Sources

Shao et al. — DeepSeekMath: Pushing the Limits of Mathematical Reasoning — accessed 2026-04-20
DeepSeek-AI — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related