Definition
Reranking is a retrieval technique that applies a more powerful model to reorder an initial set of search results, improving the ranking of truly relevant documents. It typically follows a first-stage retrieval (like vector search) and uses cross-encoder models that consider query-document pairs together for more accurate relevance scoring.
Why it matters
Reranking bridges the gap between fast retrieval and accurate relevance:
- Quality improvement — pushes the most relevant results to the top
- Precision boost — cross-encoders understand context better than bi-encoders
- RAG enhancement — ensures the best documents enter the LLM context
- Cost-effective — applies expensive models only to top candidates, not entire corpus
- Latency balance — adds ~50-100ms for significantly better results
Reranking can increase retrieval accuracy by 10-30% with minimal latency impact.
How it works
┌────────────────────────────────────────────────────────────┐
│ TWO-STAGE RETRIEVAL │
├────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: FAST RETRIEVAL (Bi-Encoder) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Query ─────────────┐ │ │
│ │ ├───► Compare Embeddings │ │
│ │ Doc Embeddings ────┘ (Approximate, Fast) │ │
│ │ │ │
│ │ Return: Top 100-500 candidates │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 2: RERANKING (Cross-Encoder) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ For each candidate: │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ [Query] [SEP] [Document] → Model → Score │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Considers full interaction (Accurate, Slower) │ │
│ │ │ │
│ │ Return: Reordered top 5-20 │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ FINAL RANKED RESULTS │
└────────────────────────────────────────────────────────────┘
Key differences:
| Aspect | Bi-Encoder (Stage 1) | Cross-Encoder (Stage 2) |
|---|---|---|
| Speed | Fast (~1ms per 1M docs) | Slow (~10ms per doc) |
| Accuracy | Good | Excellent |
| Interaction | None (separate encoding) | Full (joint encoding) |
| Scale | Entire corpus | Top candidates only |
Common questions
Q: Why not just use cross-encoders for everything?
A: Cross-encoders are too slow for large-scale retrieval. They must process each query-document pair together, making them O(n) where n is corpus size. Two-stage retrieval provides the best of both worlds.
Q: What models are used for reranking?
A: Popular rerankers include Cohere Rerank, BGE Reranker, and cross-encoder models fine-tuned on MS MARCO. These are specifically trained to score query-document relevance.
Q: How many documents should be reranked?
A: Typically 50-200 candidates from the first stage are reranked. Too few and you might miss relevant documents; too many adds unnecessary latency.
Q: Does reranking replace vector search?
A: No, it complements it. Vector search provides fast candidate retrieval; reranking improves the ordering. Both stages are needed for optimal performance.
Related terms
- RAG — pipeline that benefits from reranking
- Hybrid Search — first-stage approach that combines methods
- Semantic Search — embedding-based retrieval
- Cross-Encoder — model type used for reranking
References
Nogueira & Cho (2019), “Passage Re-ranking with BERT”, arXiv. [1,500+ citations]
Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain Question Answering”, EMNLP. [3,500+ citations]
Humeau et al. (2020), “Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring”, ICLR. [700+ citations]
Glass et al. (2022), “Re2G: Retrieve, Rerank, Generate”, NAACL. [100+ citations]