Paraphrase Detection

Paraphrase Detection is the NLP task of determining whether two sentences or passages convey the same semantic meaning despite using different words or syntactic structures — testing a model's ability to abstract away from surface form and recognize semantic equivalence, used both as a pre-training objective and as a benchmark for evaluating sentence-level semantic understanding.

Task Definition

Given two text spans A and B, the model outputs a binary classification:
- Paraphrase (1): "Apple acquired Beats Electronics." / "Beats was purchased by Apple." → Equivalent.
- Non-Paraphrase (0): "Apple acquired Beats Electronics." / "Apple released new AirPods." → Not equivalent.

The challenge lies in the continuum between clear paraphrase and clear non-paraphrase: near-paraphrases, entailments, and closely related statements occupy a gray zone that requires nuanced semantic judgment.

Distinction from Related Tasks

Paraphrase detection is closely related to but distinct from:
- Textual Entailment (NLI): Entailment is asymmetric — A entails B does not imply B entails A. "The dog bit the man" entails "a man was bitten" but the reverse is not guaranteed. Paraphrase is symmetric — both sentences must convey equivalent meaning.
- Semantic Textual Similarity (STS): STS produces a continuous score (0–5). Paraphrase detection is the binary version — converting a continuous similarity into a yes/no decision at a threshold.
- Duplicate Question Detection: An applied variant where the goal is to identify whether two forum questions are asking the same thing, crucial for Quora, Stack Overflow, and customer support systems.

Major Benchmark Datasets

MRPC (Microsoft Research Paraphrase Corpus): 5,801 sentence pairs from news articles, human-annotated for paraphrase equivalence. Used in GLUE as a standard evaluation benchmark. Baseline accuracy for the majority class is ~67%, making it a discriminating but tractable task.

QQP (Quora Question Pairs): Over 400,000 question pairs from Quora, labeled by human annotators for whether they ask the same question. Much larger than MRPC and drawn from a different domain (questions vs. news sentences). Used extensively in GLUE. Challenging because question phrasing varies enormously while underlying intent may be identical.

PAWS (Paraphrase Adversaries from Word Scrambling): Designed to fool models that rely on word overlap. Pairs are constructed by word swapping and back-translation, creating pairs with high lexical overlap that are NOT paraphrases and pairs with low overlap that ARE. Tests genuine semantic understanding rather than surface matching.

Why Paraphrase Detection Matters

Semantic Deduplication: Search engines and knowledge bases must recognize that "climate change" and "global warming" queries seek the same information. Customer support systems must cluster "my order hasn't arrived" and "I haven't received my package" as the same complaint type.

Data Augmentation: Paraphrase pairs provide supervision for training robust models. Replacing training examples with their paraphrases teaches models that surface form is irrelevant to meaning — an explicit robustness signal.

Adversarial Robustness: Models that understand paraphrases resist synonym-substitution attacks: adversarially replacing "terrible" with "dreadful" should not change a sentiment classifier's output. Training with paraphrase pairs directly enforces this invariance.

Machine Translation Evaluation: BLEU score measures n-gram overlap, penalizing valid paraphrase translations. Paraphrase-aware metrics (METEOR, BERTScore) provide fairer evaluation by recognizing that different words can correctly translate the same source content.

Pre-training and Fine-tuning Applications

Paraphrase as Pre-training: SimCSE uses paraphrase pairs as positive examples for contrastive pre-training of sentence encoders — pulling paraphrase representations together and pushing non-paraphrase representations apart. This directly trains the sentence embedding space to represent semantic equivalence.

SBERT (Sentence-BERT): Fine-tunes BERT on NLI and STS data using siamese and triplet networks to produce sentence embeddings where cosine similarity correlates with semantic equivalence. Evaluated directly on paraphrase identification tasks.

T5 and Generation: Paraphrase generation — producing a paraphrase of an input sentence — is trained as a sequence-to-sequence task and used for data augmentation.

Model Approaches

Cross-Encoder (for Accuracy): Concatenate sentence A and B with a [SEP] token and feed to BERT. The [CLS] representation sees both sentences simultaneously, enabling full cross-attention between them. Highest accuracy but O(n²) complexity for ranking tasks.

Bi-Encoder (for Scale): Encode sentences A and B independently into vectors and compute cosine similarity. O(n) scaling enables efficient retrieval over millions of candidates. Lower accuracy than cross-encoder but essential for large-scale duplicate detection.

Contrastive Learning (SimCSE): Train using in-batch negatives — all other sentence pairs in the mini-batch serve as negative examples. Achieves strong performance without explicit paraphrase labels by using dropout as a data augmentation.

PAWS and the Lexical Overlap Trap

PAWS revealed a fundamental weakness in pre-BERT models: they relied heavily on word overlap to identify paraphrases. "Flights from New York to London" was correctly classified as a paraphrase of "Flights from London to New York" by overlap-based models — missing the semantic difference. BERT-era models showed substantially stronger performance on PAWS because attention mechanisms enable genuine semantic comparison rather than bag-of-words overlap.

Paraphrase Detection is recognizing the same thought in different words — the fundamental test of whether a model understands meaning rather than memorizes surface form, and the benchmark that distinguishes genuine semantic understanding from lexical pattern matching.

Want to learn more?