Curiosity · Concept

Direct Preference Optimization (DPO)

DPO, introduced by Rafailov et al. in 2023, reformulates preference learning as a supervised objective. Given pairs of (preferred, dispreferred) completions, you directly adjust the model so preferred ones become more likely relative to dispreferred ones. It delivers RLHF-quality alignment without training a separate reward model or running unstable RL.

Quick reference

Proficiency: Intermediate
Also known as: DPO, preference fine-tuning
Prerequisites: RLHF, Fine-tuning

Frequently asked questions

What is DPO?

Direct Preference Optimization is a fine-tuning method that aligns a model to human preferences using a simple classification-style loss on preferred vs dispreferred response pairs. It achieves results similar to RLHF without training an explicit reward model.

How is DPO different from RLHF?

RLHF trains a reward model from preferences then uses PPO to optimize against it — two models, brittle RL loop. DPO derives a closed-form loss that the policy can minimize directly against preference pairs, with a reference model only used for a KL-like regularization term.

When should I use DPO?

When you have preference data (pairs of 'better' vs 'worse' responses) and want to align a model. DPO is easier to implement, more stable, and has become the default alignment method for most open-source post-training pipelines.

What are DPO variants I should know?

IPO avoids DPO's overfitting on perfectly-separated pairs. KTO uses a prospect-theory loss and works with non-paired data. ORPO folds SFT and preference into one stage. SimPO drops the reference model and normalizes by sequence length. Each trades stability and data assumptions differently.

Sources

Rafailov et al. — Direct Preference Optimization — accessed 2026-04-20
Hugging Face TRL — DPO Trainer — accessed 2026-04-20
Ethayarajh et al. — KTO — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related