Curiosity · Concept

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is the classical recipe for turning a raw pretrained LLM into a helpful, harmless assistant. Humans rank candidate responses, a reward model learns to predict those rankings, and PPO (or similar) fine-tunes the LLM to generate responses the reward model scores highly. RLHF was the technique behind ChatGPT and Claude's earliest alignment.

Quick reference

Proficiency: Intermediate
Also known as: RLHF, preference learning, alignment training
Prerequisites: Fine-tuning, Reinforcement learning (basics)

Frequently asked questions

What is RLHF?

Reinforcement Learning from Human Feedback is a training procedure where humans compare model outputs, a reward model learns their preferences, and the LLM is fine-tuned to maximize that reward. It converts a pretrained base model into a well-behaved chat assistant.

What are the three stages of RLHF?

1) Supervised fine-tuning (SFT) on human-written demonstrations. 2) Train a reward model on pairs where humans pick the better of two responses. 3) RL fine-tune the SFT policy against the reward model, usually with PPO, keeping a KL penalty to avoid drifting too far from the SFT model.

Why is RLHF being replaced by DPO?

PPO is complex to implement and unstable. Direct Preference Optimization (DPO) skips the reward model entirely and directly optimizes a loss that makes preferred completions more likely than dispreferred ones. It achieves comparable quality with a simpler, more stable pipeline.

What are common failure modes of RLHF?

Reward hacking (the model finds shortcuts the reward model likes but humans don't), sycophancy (agreeing with users regardless of truth), mode collapse (output diversity drops), and capability regressions where RLHF degrades skills present in the base model.

Sources

Christiano et al. — Deep RL from Human Preferences — accessed 2026-04-20
Ouyang et al. — InstructGPT — accessed 2026-04-20
Hugging Face — Illustrating RLHF — accessed 2026-04-20
Anthropic — Training a Helpful and Harmless Assistant with RLHF — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related