Rerankers and Cross-Encoders

Rerankers and Cross-Encoders are the second-stage retrieval components that score candidate documents with high accuracy by jointly processing query-document pairs through a transformer model — dramatically improving search precision over first-stage retrieval at the cost of higher latency, enabling the accuracy-speed trade-off central to production RAG and search systems.

What Is a Reranker?

- Definition: A model that takes a (query, document) pair as a single input and outputs a relevance score — enabling fine-grained relevance assessment that captures query-document interactions invisible to separate bi-encoder embeddings.
- Two-Stage Pipeline: Fast first-stage retrieval (BM25 or dense retrieval) generates N candidates (typically 100–1,000); slow but accurate reranker scores the top-N to select final top-K (typically 3–10).
- Architecture: Cross-encoder — query and document concatenated with [SEP] token and fed through BERT/transformer; CLS token output predicts relevance score.
- Improvement: Typical reranker adds 5–20% improvement in NDCG@10 over bi-encoder retrieval alone on BEIR benchmark.

Why Rerankers Matter

- Precision at Rank 1: For RAG systems, only the top 3–5 passages are fed to the LLM — even small improvements in precision at top ranks dramatically reduce hallucinations.
- Semantic Accuracy: Cross-encoders see both query and document together, allowing attention to flow between them — capturing negation, specificity, and contextual matching invisible to separate encoders.
- Query-Specific Ranking: Separate bi-encoders cannot model "how relevant is this specific document to this specific query" — cross-encoders can.
- Flexible Integration: Works with any first-stage retrieval (keyword, dense, or hybrid) as a modular plug-in component.
- Cost-Effective: Reranking only the top-N candidates (not the full corpus) keeps latency acceptable — typically adding 50–200ms for 100 candidates.

Bi-Encoder vs. Cross-Encoder Trade-offs

Bi-Encoder (First Stage):
- Encodes query and documents separately into vectors.
- Documents pre-computed offline; query encoded at runtime.
- Retrieves via fast ANN search — millions of documents in milliseconds.
- Cannot model cross-document interactions; less accurate for subtle relevance distinctions.

Cross-Encoder (Reranker):
- Concatenates query + document as single input: "[CLS] query [SEP] document [SEP]".
- Attention flows freely between query and document tokens — captures fine-grained semantic alignment.
- Cannot be pre-computed; must run inference for every query-document pair at runtime.
- 10–100x slower than bi-encoder retrieval; only practical for small candidate sets.

Key Reranker Models

- MS MARCO Rerankers (Hugging Face): BERT, MiniLM, and DeBERTa-based cross-encoders trained on MS MARCO passage ranking dataset. Standard production baselines.
- Cohere Rerank: Commercial API reranker with multilingual support and strong performance on enterprise content types.
- Jina Reranker: Open-source cross-encoder with competitive performance and efficient inference.
- BGE Reranker (BAAI): Strong open-source cross-encoder; BGE-Reranker-v2 achieves near-commercial accuracy.
- Colbert v2: Late interaction model — per-token MaxSim scoring balances accuracy and speed between bi-encoder and cross-encoder extremes.
- RankGPT / LLM Reranking: Use LLM (GPT-4, Claude) to listwise-rank candidates via prompting. Highest accuracy; highest cost.

Complete Two-Stage Retrieval Pipeline

Stage 1 — Candidate Generation (fast):
- Hybrid retrieval: BM25 (Elasticsearch) + dense retrieval (FAISS/pgvector) → top 100 candidates via Reciprocal Rank Fusion.
- Latency: 10–50ms for million-document corpus.

Stage 2 — Reranking (accurate):
- Cross-encoder scores all 100 candidates.
- Select top-5 for LLM context.
- Latency: 50–200ms on GPU for 100 candidates with MiniLM.

Stage 3 — Generation:
- LLM generates response from top-5 reranked passages.

Performance Benchmark (BEIR)

| Method | NDCG@10 | Latency | Cost |
|--------|---------|---------|------|
| BM25 only | 43.5 | 10ms | Minimal |
| Dense (bi-encoder) | 47.2 | 30ms | Moderate |
| Hybrid | 50.1 | 40ms | Moderate |
| Hybrid + cross-encoder rerank | 56.8 | 200ms | Higher |
| Hybrid + LLM rerank | 59.3 | 2000ms | High |

Practical Implementation

``from sentence_transformers import CrossEncoder model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") # Score query-document pairs scores = model.predict([ ("What is semiconductor yield?", doc1), ("What is semiconductor yield?", doc2), ]) ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)``

Rerankers are the precision layer that separates good retrieval from great retrieval — as cross-encoder models shrink via distillation and run on-device, two-stage pipelines will become the universal standard for production RAG systems requiring high-accuracy, low-hallucination responses.

Want to learn more?