Definition

Retrieval recall is the fraction of all truly relevant documents in a collection that the system successfully retrieves for a given query. If there are 10 relevant documents in the corpus and the system retrieves 7 of them, recall is 70%. Recall is typically measured at a cut-off point — Recall@10 (of the 10 relevant documents, how many appear in the top 10 results?) or Recall@100 (for initial candidate generation). In legal AI, high recall is critical because a missing relevant provision — an exception, an amendment, or a conflicting ruling — can fundamentally change the correct answer.

Why it matters

Completeness of legal analysis — tax law is full of exceptions, special regimes, and conflicting provisions across jurisdictions; missing even one relevant source can lead to an incomplete or incorrect answer
RAG answer quality ceiling — the language model can only reason over what the retrieval layer provides; if a relevant document is not retrieved, it cannot be included in the generated answer
Risk management — in professional tax advice, failing to consider a relevant provision is a liability risk; high recall reduces the chance of overlooking critical sources
Complementary to precision — precision measures result quality (how many returned documents are relevant); recall measures coverage (how many relevant documents were found); both are needed for effective retrieval

How it works

Recall is computed by dividing the number of relevant documents retrieved by the total number of relevant documents in the corpus:

Recall@k = (relevant documents in top k) / (total relevant documents)

Computing recall requires knowing the complete set of relevant documents for each query, which is established through human annotation of a test set. Annotators review the corpus and identify all documents relevant to each test query, creating the ground truth against which the system is measured.

Recall-precision trade-off — increasing recall (finding more relevant documents) typically requires returning more results overall, which may include more irrelevant ones (reducing precision). The retrieval pipeline manages this trade-off through its staged architecture: early stages (BM25, dense retrieval) cast a wide net for high recall, while later stages (reranking, filtering) refine for precision.

Recall at different pipeline stages — recall is often measured at each stage to identify bottlenecks. If initial candidate retrieval achieves 95% Recall@100 but reranking drops to 70% Recall@10, the reranker is the bottleneck. If initial retrieval only achieves 60% Recall@100, the fundamental retrieval strategy needs improvement.

Improving recall involves multiple strategies:

Hybrid search (combining lexical and semantic retrieval) covers more matching strategies
Query expansion (adding synonyms and related terms) broadens the search
Cross-lingual retrieval (searching across language boundaries) finds sources in all three Belgian languages
Reducing chunk size can improve recall by allowing more fine-grained matching

Common questions

Q: What is a good recall score for legal search?

A: Recall@100 above 90% is generally expected for initial candidate retrieval. Recall@10 (after reranking) above 80% is considered strong. The acceptable threshold depends on the risk profile — higher-stakes applications demand higher recall.

Q: Why is recall harder to measure than precision?

A: Measuring recall requires knowing all relevant documents in the corpus for each query, which is expensive and labour-intensive to annotate. Precision only requires judging the returned results. This is why recall evaluation relies on carefully curated test sets rather than ad hoc testing.

References

Manning et al. (2008), “Introduction to Information Retrieval”, Cambridge University Press.
Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain Question Answering”, EMNLP.
Thakur et al. (2021), “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models”, NeurIPS.