Definition
A bi-encoder (also called dual encoder or two-tower model) is a neural architecture that encodes queries and documents independently into dense vector representations. Each input passes through a separate encoder (or shared-weight encoder) to produce a fixed-size embedding. Relevance is computed by measuring similarity (typically cosine or dot product) between these pre-computed vectors. This architecture enables scalable retrieval: document embeddings can be computed offline and indexed in approximate nearest neighbor (ANN) data structures, allowing sub-second search over millions of documents.
Why it matters
Bi-encoders are the foundation of modern retrieval:
- Scalable search — pre-compute document embeddings once, reuse for all queries
- Real-time retrieval — millisecond latency over billion-scale corpora
- Semantic matching — understand meaning beyond keyword overlap
- Dense retrieval — the dominant paradigm replacing sparse methods
- RAG enabler — power retrieval in retrieval-augmented generation systems
- Hybrid systems — combine with BM25 and cross-encoders for best results
Without bi-encoders, semantic search at scale would be computationally infeasible.
How it works
┌────────────────────────────────────────────────────────────┐
│ BI-ENCODER │
├────────────────────────────────────────────────────────────┤
│ │
│ ARCHITECTURE: │
│ ───────────── │
│ │
│ TWO-TOWER DESIGN: │
│ │
│ Query Document │
│ │ │ │
│ ↓ ↓ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Query │ │ Document │ │
│ │ Encoder │ │ Encoder │ │
│ │ (Tower 1) │ │ (Tower 2) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ↓ ↓ │
│ ┌───────────┐ ┌───────────┐ │
│ │ Query Vec │ │ Doc Vec │ │
│ │ [768-dim] │ │ [768-dim] │ │
│ └───────────┘ └───────────┘ │
│ \ / │
│ \ / │
│ \ / │
│ → similarity(q, d) ← │
│ cosine / dot product │
│ │ │
│ ↓ │
│ Relevance Score │
│ │
│ │
│ ENCODER CONFIGURATIONS: │
│ ─────────────────────── │
│ │
│ 1. Shared weights (Siamese): │
│ Query ─┐ │
│ ├──→ Same Encoder ──→ Same vector space │
│ Doc ───┘ │
│ │
│ 2. Separate weights (Dual): │
│ Query ──→ Encoder A ──→ ┐ │
│ ├──→ Aligned space │
│ Doc ────→ Encoder B ──→ ┘ │
│ │
│ Most models use shared weights for simplicity │
│ │
│ │
│ INDEXING AND RETRIEVAL WORKFLOW: │
│ ──────────────────────────────── │
│ │
│ OFFLINE (Indexing Phase): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Document Collection │ │
│ │ ┌─────┬─────┬─────┬─────┬─────┬───────┐ │ │
│ │ │Doc 1│Doc 2│Doc 3│Doc 4│ ... │Doc N │ │ │
│ │ └─────┴─────┴─────┴─────┴─────┴───────┘ │ │
│ │ │ │ │ │ │ │ │
│ │ ↓ ↓ ↓ ↓ ↓ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ Document Encoder │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ │ │ │ │ │ │ │ │
│ │ ↓ ↓ ↓ ↓ ↓ │ │
│ │ [v₁] [v₂] [v₃] [v₄] ... [vₙ] │ │
│ │ │ │ │ │ │ │ │
│ │ └─────┴─────┴─────┴───────────┘ │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ ANN Index │ │ │
│ │ │ (FAISS, HNSW) │ │ │
│ │ └─────────────────┘ │ │
│ │ │ │
│ │ One-time cost: N encoder passes │ │
│ │ │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ONLINE (Query Phase): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ User Query: "What is machine learning?" │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Query Encoder │ (~5-10ms on GPU) │ │
│ │ └─────────────────┘ │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ [q_vec] │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ ANN Search │ (~1-5ms) │ │
│ │ │ top-k nearest │ │ │
│ │ └─────────────────┘ │ │
│ │ │ │ │
│ │ ↓ │ │
│ │ Ranked results: Doc₄, Doc₁, Doc₇, ... │ │
│ │ │ │
│ │ Total latency: ~10-20ms for millions of docs! │ │
│ │ │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ │
│ COMPARISON: BI-ENCODER vs CROSS-ENCODER: │
│ ───────────────────────────────────────── │
│ │
│ ┌─────────────────┬───────────────────────────────────┐ │
│ │ Aspect │ Bi-Encoder │ Cross-Encoder │ │
│ ├─────────────────┼───────────────┼──────────────────┤ │
│ │ Encoding │ Separate │ Joint │ │
│ │ Pre-compute │ ✓ Yes │ ✗ No │ │
│ │ Latency (1M) │ ~10ms │ ~hours │ │
│ │ Accuracy │ Good │ Better │ │
│ │ Use case │ Retrieval │ Reranking │ │
│ │ Token interact. │ None │ Full attention │ │
│ └─────────────────┴───────────────┴──────────────────┘ │
│ │
│ │
│ POPULAR BI-ENCODER MODELS: │
│ ────────────────────────── │
│ │
│ ┌──────────────────────────────┬─────────────────────┐ │
│ │ Model │ Notes │ │
│ ├──────────────────────────────┼─────────────────────┤ │
│ │ all-MiniLM-L6-v2 │ Fast, 384-dim │ │
│ │ all-mpnet-base-v2 │ Better quality │ │
│ │ e5-large-v2 │ Strong general │ │
│ │ bge-large-en-v1.5 │ Top performance │ │
│ │ gte-large │ Alibaba model │ │
│ │ OpenAI text-embedding-3 │ API, 3072-dim │ │
│ │ Cohere embed-v3 │ API, multilingual │ │
│ │ voyage-2 │ API, high quality │ │
│ └──────────────────────────────┴─────────────────────┘ │
│ │
│ │
│ CODE EXAMPLE: │
│ ───────────── │
│ │
│ from sentence_transformers import SentenceTransformer │
│ import numpy as np │
│ │
│ # Load bi-encoder model │
│ model = SentenceTransformer('all-MiniLM-L6-v2') │
│ │
│ # Encode documents (offline) │
│ documents = [ │
│ "Machine learning is a subset of AI.", │
│ "Deep learning uses neural networks.", │
│ "The cat sat on the mat." │
│ ] │
│ doc_embeddings = model.encode(documents) │
│ │
│ # Encode query (online) │
│ query = "What is artificial intelligence?" │
│ query_embedding = model.encode(query) │
│ │
│ # Compute similarities │
│ from sklearn.metrics.pairwise import cosine_similarity │
│ similarities = cosine_similarity( │
│ [query_embedding], │
│ doc_embeddings │
│ )[0] │
│ │
│ # Results: [0.62, 0.45, 0.08] │
│ # Doc 1 most relevant (about AI) │
│ │
└────────────────────────────────────────────────────────────┘
Common questions
Q: Why are bi-encoders less accurate than cross-encoders?
A: Bi-encoders compress each text into a fixed-size vector independently—no token-level interaction between query and document. Cross-encoders see both together, allowing attention between “python” in query and “snake” or “programming” in document. This information bottleneck trades accuracy for efficiency.
Q: How do I choose between bi-encoder models?
A: Consider: (1) dimension vs. storage cost, (2) inference speed requirements, (3) task/domain fit—check MTEB leaderboard, (4) multilingual needs. Start with all-mpnet-base-v2 for general use, e5-large for quality, MiniLM for speed.
Q: Can bi-encoders handle long documents?
A: Most bi-encoders have 512 token limits. For long documents: (1) chunk into passages, embed each, (2) use max-pooling or attentive pooling over chunks, (3) consider models with longer context (LongFormer-based), or (4) use late interaction models like ColBERT.
Q: What’s late interaction?
A: Models like ColBERT keep token-level embeddings instead of single-vector representations. Query tokens match against document tokens via MaxSim. This preserves some cross-encoder accuracy while enabling pre-computation—a middle ground between bi-encoders and cross-encoders.
Related terms
- Cross-encoder — higher-accuracy alternative for reranking
- Dense retrieval — retrieval using bi-encoder embeddings
- Embedding — vector representation produced
- Semantic search — primary application
References
Reimers & Gurevych (2019), “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, EMNLP. [Foundational bi-encoder paper]
Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain Question Answering”, EMNLP. [DPR bi-encoder for QA]
Muennighoff et al. (2022), “MTEB: Massive Text Embedding Benchmark”, arXiv. [Bi-encoder evaluation benchmark]
Khattab & Zaharia (2020), “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction”, SIGIR. [Late interaction advancement]