Curiosity · Concept
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO), introduced by Schulman et al. in 2017, is the reinforcement-learning workhorse behind classic RLHF systems like InstructGPT, ChatGPT, and GPT-4. PPO is an actor-critic method: a policy network is nudged toward higher reward, a value network estimates expected return, and a clipped surrogate objective prevents any single update from pulling the policy too far from the previous step. In LLM post-training, PPO operates on generated completions, uses a reward model to score them, and adds a KL penalty against the reference model to preserve fluency.
Quick reference
- Proficiency
- Advanced
- Also known as
- PPO, clipped PPO
- Prerequisites
- Policy gradient methods, RLHF basics
Frequently asked questions
What is PPO?
PPO is an on-policy actor-critic RL algorithm. It maximises a clipped surrogate objective: the policy update is bounded so the new policy can't move too far from the old one in a single step, which keeps training stable.
Why is PPO used for RLHF?
RLHF needs stable updates on a huge language model with noisy, learned reward signals. PPO's clipping plus a KL penalty against the reference model provide the stability needed to fine-tune LLMs on preference rewards without collapsing their fluency.
What are PPO's drawbacks for LLMs?
PPO requires a value network (big memory cost), careful hyperparameter tuning, and many rollouts. It's also sensitive to reward hacking. That's why alternatives like DPO (no RL loop) and GRPO (no value network) are gaining ground.
How does PPO compare to DPO and GRPO?
DPO reformulates preference learning as supervised learning with no sampling or reward model. GRPO keeps PPO-style RL but drops the value network, using group-relative advantages instead. Classical PPO is the most general but most expensive of the three.
Sources
- Schulman et al. — Proximal Policy Optimization Algorithms — accessed 2026-04-20
- Ouyang et al. — Training language models to follow instructions with human feedback (InstructGPT) — accessed 2026-04-20