Patent Similarity

Keywords: patent similarity, legal ai

Patent Similarity is the NLP task of computing semantic similarity between patent documents — enabling prior art search, patent clustering, portfolio analysis, and infringement detection by measuring how closely two patents cover the same technological concept, regardless of differences in claim language, inventor vocabulary, and jurisdiction-specific drafting conventions.

What Is Patent Similarity?

- Task Definition: Given two patent documents (or a query and a corpus), compute a similarity score capturing semantic and technical overlap.
- Granularity Levels: Abstract-level similarity (quick screening), claim-level similarity (legal overlap assessment), full-document similarity (comprehensive overlap).
- Applications: Prior art search, duplicate patent detection, patent clustering for landscape analysis, licensable patent identification, citation recommendation.
- Benchmark Datasets: CLEF-IP (patent prior art retrieval), BigPatent (multi-document patent similarity), PatentsView similarity tasks, WIPO IPC classification with similarity.

Why Patent Similarity Is Hard

Deliberate Claim Language Variation: Patent attorneys intentionally use different vocabulary for the same concept to achieve claim differentiation or breadth. "A system for processing data" and "an apparatus for information manipulation" may cover identical technology — surface similarity is insufficient.

Hierarchical Claim Structure: Claim 1 (broad, independent) may be similar to another patent's Claim 1 at a high level, but the dependent claims narrow the scope differently. True similarity requires analyzing the claim hierarchy.

Cross-Language Patents: The same invention is often patented in English, German, Japanese, Chinese, and Korean — similarity across languages requires multilingual embeddings.

Technical vs. Legal Similarity: Two patents may use the same technical concept (transformer neural networks) with entirely different claim scope — one covering a specific hardware implementation, another a training algorithm. Technical similarity ≠ legal overlap.

Figures and Formulas: Chemical patents encode core invention in SMILES strings and structural formulas; mechanical patents in technical drawings — full similarity requires multi-modal comparison.

Similarity Computation Approaches

Lexical Overlap (BM25 / TF-IDF):
- Fast baseline; misses synonym variations.
- Still competitive for within-domain prior art retrieval.
- CLEF-IP: BM25 achieves MAP@10 ~0.35.

Bi-Encoder Dense Retrieval (PatentBERT, AugPatentBERT):
- Encode patent sections to dense vectors; compute cosine similarity.
- PatentBERT (Sharma et al.): Pre-trained on 3M US patent abstracts.
- Achieves MAP@10 ~0.44 on CLEF-IP.

Cross-Encoder Reranking:
- Take top-100 BM25 candidates; rerank with cross-encoder (full-interaction model).
- Most accurate but computationally expensive — suitable for final-stage legal review.

Claim Decomposition + Matching:
- Parse claims into functional sub-elements.
- Match sub-elements between patents individually.
- More interpretable for FTO analysis — "4 of 7 claim elements overlap."

Performance Results (CLEF-IP Prior Art Retrieval)

| System | MAP@10 | Recall@100 |
|--------|--------|-----------|
| TF-IDF baseline | 0.31 | 0.54 |
| BM25 | 0.35 | 0.61 |
| PatentBERT bi-encoder | 0.44 | 0.71 |
| Cross-encoder reranking | 0.52 | 0.74 |
| GPT-4 reranker (top-10) | 0.55 | — |

Commercial Patent Similarity Tools

- Derwent Innovation (Clarivate): AI-powered patent similarity with citation-network features.
- Innography (Clarivate): Semantic patent search with cluster visualization.
- PatSnap: Patent similarity + landscape automated reporting.
- Ambercite: Citation-network-based patent similarity (network centrality as relevance proxy).

Why Patent Similarity Matters

- USPTO Examination: USPTO examiners use automated similarity tools to efficiently identify prior art during the examination process — AI-assisted search reduces examination time while improving prior art recall.
- Patent Invalidation: Defendants in IPR (Inter Partes Review) proceedings must find the most similar prior art under tight deadlines — semantic similarity search is essential.
- Portfolio De-Duplication: Large patent portfolios (IBM: 9,000+/year; Samsung: 8,000+/year) contain overlapping coverage that drives unnecessary maintenance fees — similarity-based clustering identifies rationalization opportunities.
- Licensing Efficiency: Technology licensors can identify all licensees whose products fall within patent scope by similarity-screening product descriptions against patent claims.

Patent Similarity is the semantic prior art compass — enabling precise navigation of the 110-million patent corpus to identify the documents that define, overlap, or anticipate any given patented invention, grounding every IP strategy decision in comprehensive knowledge of the existing intellectual property landscape.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT