Definition
An inverted index is a data structure that maps content (typically words or terms) to their locations in documents. Instead of storing “document → words,” it stores “word → documents,” which is why it’s called “inverted.” This seemingly simple flip enables sub-millisecond search across billions of documents—when you search for “climate change,” the system looks up two posting lists and intersects them, rather than scanning every document. The inverted index is the fundamental data structure powering search engines from Google to Elasticsearch.
Why it matters
Inverted indexes make modern search possible:
- Speed — O(1) lookup per term vs O(n) document scanning
- Scale — search billions of documents in milliseconds
- Ubiquity — powers Google, Bing, Elasticsearch, Solr, Lucene, all major search
- Efficiency — only stores relevant document references, not full content
- Foundation — enables BM25, TF-IDF, and all sparse retrieval methods
- Combinability — boolean operations (AND, OR, NOT) are fast set operations
Understanding inverted indexes is essential for anyone building search systems.
How it works
┌────────────────────────────────────────────────────────────┐
│ INVERTED INDEX │
├────────────────────────────────────────────────────────────┤
│ │
│ THE FUNDAMENTAL INVERSION: │
│ ────────────────────────── │
│ │
│ FORWARD INDEX (how documents are stored): │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Doc 1: "The quick brown fox" │ │
│ │ Doc 2: "The lazy dog sleeps" │ │
│ │ Doc 3: "Quick brown dogs run" │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ To find "brown": scan ALL documents → O(n) = SLOW │
│ │
│ │
│ INVERTED INDEX (term → document mapping): │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Term │ Posting List (documents containing) │ │
│ ├───────────┼──────────────────────────────────────┤ │
│ │ "brown" │ → [Doc 1, Doc 3] │ │
│ │ "dog" │ → [Doc 2, Doc 3] │ │
│ │ "dogs" │ → [Doc 3] │ │
│ │ "fox" │ → [Doc 1] │ │
│ │ "lazy" │ → [Doc 2] │ │
│ │ "quick" │ → [Doc 1, Doc 3] │ │
│ │ "run" │ → [Doc 3] │ │
│ │ "sleeps" │ → [Doc 2] │ │
│ │ "the" │ → [Doc 1, Doc 2] │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ To find "brown": lookup hash table → O(1) = FAST! │
│ │
│ │
│ FULL INDEX STRUCTURE: │
│ ───────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ DICTIONARY (vocabulary) │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ Term │ Doc Freq │ Pointer to Postings │ │ │
│ │ ├──────────┼──────────┼──────────────────────┤ │ │
│ │ │ "brown" │ 2 │ ────────┐ │ │ │
│ │ │ "climate"│ 156 │ ────────┼──┐ │ │ │
│ │ │ "dog" │ 3 │ ────────┼──┼──┐ │ │ │
│ │ └──────────┴──────────┴────────┼──┼──┼───────┘ │ │
│ │ │ │ │ │ │
│ │ POSTING LISTS ▼ ▼ ▼ │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ "brown": [D1:2, D3:1] │ │ │
│ │ │ (doc_id: term_freq) │ │ │
│ │ │ │ │ │
│ │ │ "climate": [D5:3, D12:1, D45:2, D89:1..] │ │ │
│ │ │ │ │ │
│ │ │ "dog": [D2:1, D3:2, D99:1] │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ │
│ QUERY PROCESSING: │
│ ───────────────── │
│ │
│ Query: "brown dog" │
│ │
│ Step 1: Tokenize query → ["brown", "dog"] │
│ │
│ Step 2: Lookup posting lists │
│ "brown" → [D1, D3] │
│ "dog" → [D2, D3] │
│ │
│ Step 3: Boolean operations │
│ │
│ AND (both terms): │
│ [D1, D3] ∩ [D2, D3] = [D3] │
│ │
│ OR (either term): │
│ [D1, D3] ∪ [D2, D3] = [D1, D2, D3] │
│ │
│ Step 4: Score and rank (BM25, TF-IDF, etc.) │
│ │
│ │
│ POSTING LIST FORMATS: │
│ ───────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ BASIC: Just document IDs │ │
│ │ "climate" → [5, 12, 45, 89, 156, ...] │ │
│ │ │ │
│ │ WITH FREQUENCY: doc_id:term_freq pairs │ │
│ │ "climate" → [5:3, 12:1, 45:2, 89:1, ...] │ │
│ │ (needed for TF-IDF/BM25 scoring) │ │
│ │ │ │
│ │ WITH POSITIONS: doc_id:positions list │ │
│ │ "climate" → [5:[12,45,89], 12:[3], ...] │ │
│ │ (needed for phrase search "climate change") │ │
│ │ │ │
│ │ WITH OFFSETS: character positions │ │
│ │ (needed for highlighting search results) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ │
│ INDEX CONSTRUCTION PIPELINE: │
│ ──────────────────────────── │
│ │
│ Documents ──► Tokenize ──► Normalize ──► Index │
│ │
│ Tokenize: │
│ "The Quick-Brown fox!" → ["The", "Quick", "Brown", "fox"]│
│ │
│ Normalize: │
│ • Lowercase: "Quick" → "quick" │
│ • Remove punctuation │
│ • Stemming: "running" → "run" │
│ • Stopword removal: remove "the", "is", "at" │
│ │
│ Index: │
│ Add to posting lists + update dictionary │
│ │
│ │
│ COMPRESSION TECHNIQUES: │
│ ─────────────────────── │
│ │
│ Document IDs in posting lists are sorted, so store deltas:│
│ │
│ Original: [5, 12, 45, 89, 156] │
│ Deltas: [5, 7, 33, 44, 67] (differences) │
│ │
│ Delta encoding + variable-byte encoding = major savings │
│ Google's index: petabytes → manageable with compression │
│ │
│ │
│ SKIP LISTS FOR FASTER INTERSECTION: │
│ ─────────────────────────────────── │
│ │
│ "the" posting list: [1,2,3,4,5,...,999,1000] (huge!) │
│ │
│ Add skip pointers: │
│ Level 2: [1, ───────────► 100, ────────► 200, ...] │
│ Level 1: [1, ──► 10, ──► 20, ... 100, ──► 110, ...] │
│ Level 0: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...] │
│ │
│ Can skip ahead when intersecting with smaller list │
│ │
└────────────────────────────────────────────────────────────┘
Index storage comparison:
| Data | Forward Index | Inverted Index |
|---|---|---|
| Search time | O(n × doc_length) | O(query_terms) |
| Storage | Full documents | Term → doc mappings |
| Boolean ops | Slow (full scan) | Fast (set ops) |
| Update | Easy (append) | Harder (update lists) |
Common questions
Q: How does the inverted index handle updates?
A: Modern systems use a “merge” strategy: new documents go into a small in-memory index, which periodically merges into larger on-disk segments. Deletes are handled via a deletion bitmap—documents are marked deleted but not physically removed until segment merge. This is how Elasticsearch/Lucene achieve both fast indexing and fast search.
Q: What’s the difference between inverted index and vector index?
A: Inverted indexes map terms to documents (sparse, exact matching). Vector indexes (FAISS, HNSW) map embedding vectors to approximate nearest neighbors (dense, semantic matching). Modern search often uses both: inverted index for keyword/exact search, vector index for semantic search, combined in hybrid retrieval.
Q: How do phrase searches work with inverted indexes?
A: Position information is stored in posting lists. For “climate change,” the system finds documents containing both terms, then checks if “climate” appears at position N and “change” at position N+1 in the same document. This is more expensive than single-term search but still fast due to the inverted structure.
Q: How large are inverted indexes compared to the original documents?
A: Typically 10-30% of original document size, depending on compression and what’s stored (just doc IDs vs frequencies vs positions). Heavy compression can get to 5-10%. The index trades space for speed—it’s much smaller than documents but enables instant search.
Related terms
- BM25 — ranking algorithm using inverted index
- TF-IDF — weighting scheme for index scores
- Sparse retrieval — retrieval using inverted indexes
- Dense retrieval — vector-based alternative
References
Manning et al. (2008), “Introduction to Information Retrieval”, Cambridge University Press. [Definitive textbook on inverted indexes]
Zobel & Moffat (2006), “Inverted Files for Text Search Engines”, ACM Computing Surveys. [Comprehensive survey]
Büttcher et al. (2010), “Information Retrieval: Implementing and Evaluating Search Engines”, MIT Press. [Implementation details]
Dean (2009), “Challenges in Building Large-Scale Information Retrieval Systems”, WSDM Keynote. [Google’s index at scale]