CaseHOLD

CaseHOLD is the legal case law NLP benchmark requiring models to identify the correct legal holding from a citing case context — testing whether AI can understand the precise legal proposition a court asserts as the controlling principle of a decision, a critical capability for legal research tools, case citation verification, and judicial AI systems.

What Is CaseHOLD?

- Origin: Zheng et al. (2021) from Berkeley, built on the Harvard Law School Case Law Access Project.
- Scale: 53,137 multiple-choice examples from US federal and state case law.
- Format: A citing statement from a case + 5 candidate holdings (one correct, four distractor holdings from the same time period) → select the correct holding.
- Source Cases: Published US court opinions from federal circuit courts and state supreme courts spanning 1950-2020.
- Task Difficulty: All 5 answer choices are real legal holdings from real cases in the same legal domain — distractors are legally plausible but factually incorrect.

What Is a Legal "Holding"?

The holding is the specific legal rule or proposition the court announces as the controlling principle of its decision:

Ratio Decidendi (Holding): "A warrantless search of a vehicle is permissible when officers have probable cause to believe the vehicle contains contraband."

Obiter Dicta (Not a Holding): "We note that the defendant appeared cooperative during the stop." — observation without legal force.

CaseHOLD tests whether models understand this critical distinction — only holdings create binding precedent and can be validly cited in future cases.

Example Task

Citing Statement: "In Smith v. Jones, the court applied the holding from Carroll v. United States that [MASK] to uphold the warrantless search of the defendant's vehicle after an officer smelled marijuana."

Candidate Holdings:
- A. "A warrantless search of a vehicle is permissible upon probable cause." ✓
- B. "An officer may conduct a pat-down search of a pedestrian stopped on reasonable suspicion."
- C. "The exclusionary rule applies to evidence obtained through police misconduct."
- D. "A defendant has a reasonable expectation of privacy in sealed containers within a vehicle."
- E. "Good faith reliance on a warrant saves evidence from suppression even if the warrant is defective."

Performance Results

| Model | CaseHOLD Accuracy |
|-------|-----------------|
| Random baseline | 20.0% |
| TF-IDF retrieval | 46.8% |
| BERT-base | 70.3% |
| Legal-BERT | 75.0% |
| DeBERTa-large | 79.2% |
| GPT-4 (5-shot) | 83.1% |
| Human (law student) | ~87% |
| Human (practicing attorney) | ~92% |

Legal-BERT (pretrained on legal corpora) consistently outperforms BERT-base by ~5 points — demonstrating the value of domain-specific pretraining even for citation retrieval.

Why CaseHOLD Matters

- Legal Research Automation: Westlaw, LexisNexis, and competing legal research platforms automatically identify related cases by matching propositions of law — CaseHOLD directly evaluates this capability.
- Citator Verification: Legal citators (Shepherd's, KeyCite) track whether cited holdings remain good law — automated holding identification is prerequisite for citation validation.
- Judicial Drafting Assistance: Courts can use CaseHOLD-capable systems to verify that cited holdings accurately support the propositions for which they are cited.
- Legal Precedent Mining: Identifying all cases asserting the same holding enables systematic mapping of legal doctrine development over time.
- Domain Adaptation Signal: CaseHOLD's legal-specific performance gap validates that domain-adapted models (Legal-BERT, LegalBERT-SC) are necessary for legal AI — general models are measurably inferior.

Connection to Legal NLP Ecosystem

CaseHOLD is one task within the LexGLUE benchmark but also studied independently due to its unique role in testing holding comprehension — the most legally precise form of legal document understanding.

CaseHOLD is the legal precedent comprehension test — determining whether AI can identify the precise controlling legal proposition from a body of case law, a foundational capability for any AI system that assists with the research, drafting, or review of legal documents that depend on accurate case citation.

Want to learn more?