Curiosity · Concept

Constitutional AI (CAI)

Constitutional AI, introduced by Bai et al. (2022), has two phases. In the supervised phase, the model generates a response, critiques it against a list of natural-language principles ('be helpful and harmless,' 'avoid giving dangerous instructions'), revises accordingly, and the revised pairs are used for supervised fine-tuning. In the RL phase, two candidate responses are compared under those same principles by an AI preference model, producing pairwise preference data fed to RLHF (specifically RLAIF — RL from AI feedback). The technique scales alignment beyond the number of humans you can pay to label examples, and it's a core part of how Claude is trained.

Quick reference

Proficiency
Advanced
Also known as
CAI, RL from AI feedback (RLAIF) approach
Prerequisites
RLHF, fine-tuning, instruction-tuning

Frequently asked questions

What is Constitutional AI?

Constitutional AI is Anthropic's alignment technique where the model is trained to critique and revise its own outputs against a written set of principles. The revised responses form SFT data, and pairwise AI-judged preferences train the reward model used in RL — replacing most human harm-labeling with AI feedback.

What is the 'constitution'?

A list of natural-language principles the model uses to judge its own behavior — things like avoid deception, avoid dangerous instructions, be honest about uncertainty. Anthropic has publicly shared parts of their constitution and later versions draw from sources like the UN Declaration of Human Rights.

How does CAI differ from RLHF?

RLHF uses humans to compare responses; CAI uses an AI model prompted with the constitution. CAI is cheaper, faster, and easier to scale, but depends on the AI judge actually internalizing the principles — which is why the constitution and the judging prompts matter so much.

Is CAI the same as a system prompt with rules?

No. A system prompt with rules is inference-time steering and can be overridden by clever prompts. CAI bakes the principles into the weights through training, producing a model that refuses or revises even when the system prompt is minimal.

Sources

  1. Bai et al. — Constitutional AI: Harmlessness from AI Feedback — accessed 2026-04-20
  2. Anthropic — Claude's Constitution — accessed 2026-04-20