Curiosity · Concept
Direct Preference Optimization (DPO)
DPO, introduced by Rafailov et al. in 2023, reformulates preference learning as a supervised objective. Given pairs of (preferred, dispreferred) completions, you directly adjust the model so preferred ones become more likely relative to dispreferred ones. It delivers RLHF-quality alignment without training a separate reward model or running unstable RL.
Quick reference
- Proficiency
- Intermediate
- Also known as
- DPO, preference fine-tuning
- Prerequisites
- RLHF, Fine-tuning
Frequently asked questions
What is DPO?
Direct Preference Optimization is a fine-tuning method that aligns a model to human preferences using a simple classification-style loss on preferred vs dispreferred response pairs. It achieves results similar to RLHF without training an explicit reward model.
How is DPO different from RLHF?
RLHF trains a reward model from preferences then uses PPO to optimize against it — two models, brittle RL loop. DPO derives a closed-form loss that the policy can minimize directly against preference pairs, with a reference model only used for a KL-like regularization term.
When should I use DPO?
When you have preference data (pairs of 'better' vs 'worse' responses) and want to align a model. DPO is easier to implement, more stable, and has become the default alignment method for most open-source post-training pipelines.
What are DPO variants I should know?
IPO avoids DPO's overfitting on perfectly-separated pairs. KTO uses a prospect-theory loss and works with non-paired data. ORPO folds SFT and preference into one stage. SimPO drops the reference model and normalizes by sequence length. Each trades stability and data assumptions differently.
Sources
- Rafailov et al. — Direct Preference Optimization — accessed 2026-04-20
- Hugging Face TRL — DPO Trainer — accessed 2026-04-20
- Ethayarajh et al. — KTO — accessed 2026-04-20