Creativity · Comparison

Constitutional AI vs RLHF

RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI are the two main approaches to aligning LLMs with human values. RLHF collects human preference comparisons and trains a reward model from them. Constitutional AI uses a written set of principles and asks the model itself to critique and revise its outputs (RLAIF — reinforcement learning from AI feedback). Both are used in production frontier models; they're complementary more than mutually exclusive.

Side-by-side

Criterion Constitutional AI RLHF
Originator Anthropic (Bai et al., 2022) OpenAI, DeepMind (Christiano et al., 2017; Ouyang et al., 2022)
Feedback source AI self-critique against written principles Human preference comparisons
Cost per label Low — LLM inference High — paid human annotators
Scalability Scales with compute Bottlenecked by human annotation
Transparency of values High — principles are written and auditable Low — values are implicit in annotator choices
Typical use Claude family, some open-weights GPT family, Llama, most aligned models
Risk of bias Reflects principles authors and base model biases Reflects annotator pool biases
Data efficiency High — one principle covers many cases Moderate — needs many comparisons per behavior
Combined with other methods Often uses RLHF for some axes + CAI for others Often seeded by SFT, combined with DPO or CAI

Verdict

These aren't really competing methods — they're complementary levers that frontier labs mix. RLHF is the foundational technique: nothing beats real human judgment for nuanced preferences, tone, helpfulness. Constitutional AI extends it: once a model is already aligned enough to critique itself, you use AI feedback (cheaper, faster, more scalable) for additional axes — especially harmlessness. Anthropic's Claude models famously use Constitutional AI heavily; OpenAI's GPT family leans more on RLHF with DPO variants. For open-weights teams aligning a model post-hoc, DPO (a simpler cousin of RLHF) is the common starting point; add a CAI-style critique pass for specific harms.

When to choose each

Choose Constitutional AI if…

  • You want a transparent, auditable set of principles driving model behavior.
  • You're at scale where human annotation is a bottleneck.
  • You want to align on dimensions that are hard to source human data for (harmlessness).
  • You have a capable base model that can critique itself reliably.

Choose RLHF if…

  • You need to align on subtle human-preference dimensions (tone, helpfulness, style).
  • You have budget for human annotators and an annotation pipeline.
  • Small-scale alignment where human judgment can cover the space.
  • Your task has clear right/wrong signals humans reliably produce.

Frequently asked questions

Is DPO a replacement for RLHF?

DPO (Direct Preference Optimization) is a simpler, more stable alternative to PPO-based RLHF that uses the same human preference data directly. As of 2026 many teams use DPO in place of classic PPO RLHF. The 'HF' part (human feedback data) is still the same — DPO just changes the training math.

Can I do Constitutional AI on a small open-weights model?

Yes — the Anthropic paper describes the recipe. Start from an instruction-tuned base, write principles, have the model critique its own outputs and revise, then use those revisions as preference data for DPO. Several open-source replications exist.

Which produces more 'helpful' models?

RLHF has an edge on subtle helpfulness because humans directly signal 'this answer is more useful'. CAI can be tuned for helpfulness with appropriate principles. In practice frontier models blend both to get helpfulness from RLHF and harmlessness from CAI.

Sources

  1. Anthropic — Constitutional AI paper — accessed 2026-04-20
  2. OpenAI — InstructGPT paper — accessed 2026-04-20