Curiosity · Concept

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath and scaled up in DeepSeek-R1, is a PPO-style RL algorithm specialised for language models. Instead of training a separate critic/value network, GRPO samples a group of responses for each prompt, scores them with a reward model or verifier, and uses the mean and standard deviation of rewards within the group as a baseline to compute per-token advantages. This eliminates the memory cost of a value network while still giving stable, low-variance policy updates — and it is the algorithm that made R1's 'aha moment' reasoning RL feasible.

Quick reference

Proficiency
Advanced
Also known as
GRPO, group relative PPO
Prerequisites
PPO, RLHF, Policy gradient methods

Frequently asked questions

What is GRPO?

GRPO (Group Relative Policy Optimization) is an RL algorithm for LLM post-training that replaces PPO's value network with a group-relative baseline: for each prompt you sample K completions, z-normalize their rewards, and use that as the advantage signal.

How does GRPO differ from PPO?

PPO trains a separate value network (critic) alongside the policy, roughly doubling memory. GRPO drops the critic and instead uses the statistics of K sampled responses per prompt as a built-in baseline, making it more memory-efficient for huge LLMs.

Why does GRPO matter for reasoning models?

Reasoning RL (like DeepSeek-R1-Zero) needs long chains of thought and many rollouts. GRPO's lower memory footprint and stable group-relative advantages let DeepSeek run large-scale RL on reasoning traces where PPO would be too expensive.

What kind of reward does GRPO need?

Any per-response scalar reward: a reward model (RLHF style), a rule-based verifier (exact-match on math/code), or a judge model. DeepSeek-R1 famously used verifiable rule rewards for math and code.

Sources

  1. Shao et al. — DeepSeekMath: Pushing the Limits of Mathematical Reasoning — accessed 2026-04-20
  2. DeepSeek-AI — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL — accessed 2026-04-20