Semantic Similarity Prediction is the NLP task of assigning a continuous score indicating how semantically similar two text segments are β ranging from 0 (completely unrelated) to 5 (exactly equivalent in meaning), evaluated using the Semantic Textual Similarity (STS) benchmark family and serving as the primary evaluation for sentence embedding quality in retrieval, clustering, and search applications.
Task Definition
Given two text segments A and B, the model outputs a real-valued similarity score:
- Score 5.0: "A man is eating food." / "A man is eating a piece of food." β Nearly identical.
- Score 3.5: "A man is eating pasta." / "A man is eating Chinese food." β Related but not equivalent.
- Score 1.0: "A man is eating pasta." / "The stock market closed higher today." β Completely unrelated.
- Score 0.0: "A man is eating pasta." / "The cat sits on a cold roof." β No semantic overlap.
The task requires the model to represent meaning as a point in geometric space where distance reflects semantic closeness β the foundation of embedding-based retrieval.
The STS Benchmark Ecosystem
STS-B (Semantic Textual Similarity Benchmark): A collection of ~8,600 sentence pairs from news headlines, image captions, and forum posts, human-annotated on a 0β5 scale by multiple annotators. Included in GLUE and SuperGLUE as a standard evaluation. Performance is measured by Pearson and Spearman correlation between predicted and human scores.
STS12βSTS16: Annual SemEval competitions (2012β2016) providing domain-diverse STS test sets including news headlines, student answers, plagiarism detection, and Twitter posts. Evaluating across all domains tests model robustness to domain shift.
SICK (Sentences Involving Compositional Knowledge): 10,000 sentence pairs with both similarity scores and entailment labels, specifically constructed to require compositional understanding of negation, quantification, and argument structure.
Technical Implementation
Sentence Embedding Approach:
- Encode sentence A into vector u and sentence B into vector v using a shared encoder.
- Compute cosine similarity: sim(u, v) = (u Β· v) / (||u|| Γ ||v||).
- Scale to [0, 5] range: score = 5 Γ (1 + cosine_similarity(u, v)) / 2.
- Optimize by minimizing mean squared error between predicted score and human-labeled score.
SBERT (Sentence-BERT, 2019): The foundational architecture for STS. BERT used naively for sentence similarity requires passing all possible sentence pairs through the model, which is computationally prohibitive for retrieval over large corpora (10,000 sentences requires 50 million BERT inference passes). SBERT uses a siamese network architecture β identical BERT encoders producing sentence embeddings that can be pre-computed independently. Reduced pairwise comparison time from 65 hours to 5 seconds for 10,000 sentences.
SimCSE (Contrastive Learning for Sentence Embeddings): Trains sentence encoders using contrastive loss with positive pairs generated by passing the same sentence through the encoder twice with different dropout masks. The same sentence with different dropout patterns produces two slightly different representations that are pulled together; all other sentences in the mini-batch serve as negatives. Achieves state-of-the-art STS performance without explicit similarity labels.
The Isotropy Problem and Degenerate Embeddings
BERT sentence embeddings (obtained by averaging token representations or using [CLS]) perform poorly on STS tasks despite BERT's strong performance on classification tasks. The reason: BERT's embedding space is anisotropic β representations cluster in a narrow cone, causing high cosine similarity even between unrelated sentences. All BERT sentence embeddings are cosine-similar to each other, destroying the discriminative signal needed for STS.
Solutions:
- Whitening: Post-hoc transformation that decorrelates dimensions and normalizes variance, spreading embeddings uniformly across the space.
- Contrastive Fine-tuning (SimCSE): Explicitly pushes unrelated sentence representations apart during training, recovering isotropy.
- Prompt-based Methods (PromCSE): Use task-specific soft prompts to condition the encoder toward producing isotropic, STS-calibrated representations.
Applications of Semantic Similarity
Dense Retrieval: The foundation of semantic search. Query and document embeddings are pre-computed; at inference, nearest-neighbor search (FAISS, ScaNN, Annoy) retrieves the most semantically similar documents in milliseconds regardless of vocabulary overlap. Powers Google's MUM, Bing's semantic search, and enterprise document retrieval.
Duplicate Detection: Identify duplicate bug reports in issue trackers, duplicate questions in QA forums, and duplicate support tickets in customer service systems. Clustering by semantic similarity groups equivalent issues without requiring identical wording.
Recommendation Systems: Content-based recommendation computes similarity between item descriptions and user preference embeddings, surfacing semantically related content regardless of keyword overlap.
Cross-Lingual Retrieval: Multilingual sentence encoders (mSBERT, LaBSE) produce similarity-calibrated embeddings across 100+ languages. An English query retrieves relevant French or Chinese documents by comparing embeddings in a shared semantic space.
Quality Benchmarking for Embeddings
STS correlation is the standard metric for evaluating sentence embedding quality. When selecting or training embedding models, the STS benchmark family provides:
- Domain diversity (news, captions, forum, student answers).
- Compositional challenge (SICK).
- Robustness measurement across domains (STS12β16).
- A continuous scale that reveals fine-grained distinctions between model capability levels.
Semantic Similarity Prediction is quantifying meaning distance in geometric space β the foundational capability that enables all embedding-based search, retrieval, and clustering applications where relevant content must be found regardless of surface vocabulary differences.