ContractNLI

ContractNLI is the natural language inference benchmark for automating contract review — requiring models to determine whether specific legal clauses in non-disclosure agreements (NDAs) entail, contradict, or are neutral with respect to a set of hypothesis statements about data source, purpose, retention, and sharing obligations, directly targeting the commercial need to audit thousands of contracts simultaneously.

What Is ContractNLI?

- Origin: Koreeda & Manning (2021) from Stanford NLP.
- Scale: 607 NDAs with 17 pre-defined hypothesis types → 10,319 NLI examples.
- Format: (contract text + hypothesis) → label: Entailment / Contradiction / Not Mentioned.
- Document Length: Full NDAs averaging 3,500-8,000 tokens — requiring long-context understanding.
- Hypothesis Types: 17 fixed contract law concepts covering: data source (third-party data allowed?), purpose limitation (use only for contracted purpose?), retention (data must be deleted after contract ends?), security (adequate security measures required?), and 13 more standard NDA clauses.

The Three Core Tasks

Document-Level NLI: Does this entire contract entail, contradict, or not address the hypothesis "The Receiving Party may share data with affiliates"?

Span Identification: Which specific sentences in the contract are the evidence for the NLI label? (Multi-span extraction task.)

Hypothesis Classification: Given the evidence span, classify the entailment label — the hardest task because it requires legal clause interpretation.

Why ContractNLI Is Technically Demanding

- Legal Language Structure: NDA clauses are written in complex passive voice with qualifications, exceptions, and cross-references: "Notwithstanding the foregoing, Recipient may disclose Confidential Information to its Affiliates who have a need to know... provided that such Affiliates are bound by written confidentiality obligations..."
- Implicit Entailment: An explicit prohibition clause implicitly entails "data may not be shared with third parties" even without that exact phrase.
- Negation and Exceptions: "Data may be disclosed except when..." — models must parse double negation, conditional exceptions, and scope qualifiers.
- Cross-Reference Resolution: "As defined in Section 2.1" requires retrieving the definition from elsewhere in the document.
- Class Imbalance: "Not Mentioned" is the majority class (~60%) — models must resist always predicting it.

Performance Results

| Model | 3-Class Accuracy | Span F1 |
|-------|----------------|---------|
| DeBERTa-large (fine-tuned) | 82.4% | 71.3% |
| Longformer (full document) | 85.1% | 73.8% |
| GPT-4 (zero-shot) | 77.3% | 62.1% |
| GPT-4 (few-shot + CoT) | 84.6% | 68.4% |
| Human expert (lawyer) | ~94% | ~88% |

Why ContractNLI Matters

- M&A Due Diligence: Acquiring companies review hundreds of target company contracts. Automated ContractNLI scanning identifies data compliance issues, change-of-control clauses, and IP ownership obligations at scale.
- Procurement Compliance: Enterprise procurement teams must verify that vendor NDAs meet corporate data retention and purpose limitation standards.
- GDPR/CCPA Audit: Automatically determine whether existing contracts comply with data protection regulations requiring purpose limitation and deletion rights.
- Legal Risk Quantification: ContractNLI enables systematic risk scoring — "60% of reviewed contracts contain unrestricted affiliate sharing" — that is impossible with manual review at scale.
- Contract Drafting Assistance: Systems trained on ContractNLI can flag missing standard clauses during draft review.

Connection to the Legal NLP Ecosystem

ContractNLI is a specialized component within the broader legal NLP pipeline:
- LexGLUE: General legal NLP benchmark across 6 tasks.
- CaseHOLD: Case law citation retrieval.
- LegalBench: 162 reasoning tasks across legal domains.
- MultiLegalPile: Pretraining corpus for domain-adapted legal models.

ContractNLI is the contract compliance auditor — automating the most time-consuming part of legal due diligence by applying natural language inference to determine whether every clause in every contract satisfies every applicable policy requirement, transforming weeks of manual review into hours of automated screening.

Want to learn more?