LegalBench

LegalBench is the collaborative benchmark of 162 legal reasoning tasks — assembled by legal scholars and NLP researchers to comprehensively evaluate AI capability across the full spectrum of legal reasoning, from issue spotting and rule application to contract interpretation, statutory analysis, and professional responsibility, providing the most rigorous test of AI legal competence available.

What Is LegalBench?

- Origin: Guha et al. (2023), a collaborative effort involving 40+ law schools and legal organizations.
- Scale: 162 distinct tasks, ~90,000 total examples.
- Coverage: Tasks span six legal reasoning categories and multiple jurisdictions.
- Format: Most tasks are multiple-choice, binary classification, or short-text generation.
- Domains: Contract law, criminal law, civil procedure, constitutional law, administrative law, professional responsibility, tax law, and international law.

The Six Legal Reasoning Categories

Issue Spotting:
- Identify which legal issues are raised by a given fact pattern.
- "A pedestrian is hit by a distracted driver on a public road. What legal theories are available?" — Negligence, vicarious liability, statutory violation.

Rule Recall:
- Retrieve specific legal rules from memory.
- "Under the UCC, when does title to goods pass from seller to buyer?" — Tests legal knowledge retrieval.

Rule Application (IRAC):
- Apply a stated rule to given facts and reach a conclusion.
- Given the hearsay rule + a scenario, determine whether the statement is admissible.

Interpretation:
- Interpret ambiguous statutory or contractual text.
- "Does 'motor vehicle' in this statute include a motorcycle?" — Requires canons of construction.

Rhetorical Understanding:
- Understand the legal weight and function of arguments.
- "Which argument is most persuasive for the defendant?" — Tests advocacy comprehension.

Ethical and Professional Responsibility:
- Identify Model Rules of Professional Conduct violations.
- "The attorney represented both the buyer and seller in this transaction. Was this permissible?" — Tests conflict-of-interest rules.

Performance Results

| Model | LegalBench Average |
|-------|------------------|
| GPT-3.5 | 52.8% |
| Claude 2 | 57.3% |
| GPT-4 | 67.0% |
| Legal domain-adapted (LLaMA-2) | 58.4% |
| Human (bar-exam performance) | ~75-85% |

Key Findings from the LegalBench Paper

- Rule Application Gap: Even GPT-4 performs significantly below human bar-exam level on rule application tasks — knowing legal rules does not automatically enable correct application to novel fact patterns.
- Jurisdiction Sensitivity: Models trained primarily on US legal text perform noticeably worse on UK, EU, or international law tasks within the same benchmark.
- IRAC Structure: Models that explicitly follow Issue-Rule-Application-Conclusion structure (via prompting) outperform those that directly predict the conclusion.
- Task Diversity Effect: Averaging across 162 tasks reveals that some models excel at knowledge recall but fail at reasoning tasks — a profile invisible in single-task benchmarks.

Why LegalBench Matters

- Beyond the Bar Exam: The original "GPT-4 passes the bar exam" headline tested only a narrow slice of legal reasoning. LegalBench's 162 tasks reveal where AI legal competence genuinely fails.
- Legal AI Product Design: Tools like Harvey, CoCounsel, and Lexis+ AI need benchmark-driven understanding of which legal tasks they handle reliably vs. which require human oversight.
- Jurisdiction-Specific Deployment: LegalBench's multi-jurisdiction tasks inform deployment decisions — a model performing well on US contract law may fail on EU consumer protection law.
- Legal Education Tool: LegalBench tasks mirror the IRAC methodology taught in law school, making it a direct measure of AI legal education outcomes.
- Accountability Standard: Legal professional responsibility rules require lawyers to supervise AI outputs. LegalBench provides a systematic standard for evaluating what supervision is needed.

LegalBench is the bar exam for AI lawyers — 162 carefully designed reasoning tasks that reveal whether AI can genuinely perform legal analysis across the full breadth of legal practice, moving beyond impressive but narrow headline benchmarks to comprehensive professional competence assessment.

Want to learn more?