Debate is an AI alignment approach where two AI agents argue opposing sides of a question, and a human judge selects the most compelling argument — the key insight is that even if the judge can't solve the problem directly, they can evaluate which argument is more convincing, enabling scalable oversight of superhuman AI.
Debate Framework
- Two Agents: Agent A and Agent B take opposing positions on a question.
- Arguments: Agents alternately present arguments, evidence, and counterarguments.
- Judge: A human (or simpler AI) evaluates the debate and selects the winner.
- Training: Agents are trained to win debates — incentivized to find and present truthful, compelling arguments.
Why It Matters
- Scalable Oversight: The judge doesn't need to know the answer — just evaluate arguments. Enables oversight of superhuman AI.
- Truth-Seeking: In a zero-sum debate, the optimal strategy is to present truth — lies can be exposed by the opponent.
- Alignment: If debate incentivizes truth-telling, it provides a scalable mechanism for aligning AI with human values.
Debate is adversarial truth-finding — using competitive argumentation to elicit truthful AI outputs that human judges can verify.