constitutional ai

Constitutional AI (CAI) is an Anthropic technique that trains models to be helpful, harmless, and honest by using AI-generated feedback based on a set of principles (constitution), reducing reliance on human feedback for safety training. Two-stage process: (1) supervised learning from AI-critiqued responses (model revises outputs based on constitutional principles), (2) RLHF using AI preferences (model trained on which response better follows principles). Constitution: explicit set of principles like "avoid harmful content," "be helpful," "don't deceive"—model reasons about these in chain-of-thought during critique. Self-critique: model generates response, then critiques it against principles, then generates revised response—creates training data without human annotation. CAI vs. standard RLHF: RLHF requires extensive human preference labels; CAI bootstraps from principles with AI-generated preferences. Red teaming integration: identify harmful prompts, generate responses, self-critique dangerous outputs, learn safer alternatives. Transparency: explicit principles are auditable—can understand and adjust what the model is trained to value. Scalable oversight: as capabilities increase, human review becomes bottleneck; CAI enables automated safety training. Limitations: model's understanding of principles limited by its capability; principles may conflict in edge cases. Claude: Anthropic's models trained using CAI methodology. Influential approach for scalable AI safety training through principled self-improvement.

Want to learn more?