Definition
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that measures how important a word is to a document within a collection. It multiplies two components: term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across all documents). Words that appear frequently in one document but rarely across the collection get high TF-IDF scores, making them useful for document characterization, search ranking, and feature extraction.
Why it matters
TF-IDF is foundational to information retrieval and NLP:
- Search engines — early Google used TF-IDF; it remains relevant in modern search
- Document similarity — compare documents using TF-IDF vectors
- Feature extraction — convert text to numerical features for ML models
- Keyword extraction — identify important terms in documents
- Text classification — traditional ML classifiers use TF-IDF features
- Foundation for BM25 — understanding TF-IDF helps understand modern ranking
Despite being 50+ years old, TF-IDF concepts underpin modern search and NLP.
How it works
┌────────────────────────────────────────────────────────────┐
│ TF-IDF │
├────────────────────────────────────────────────────────────┤
│ │
│ THE TF-IDF FORMULA: │
│ ────────────────── │
│ │
│ TF-IDF(t, d, D) = TF(t, d) × IDF(t, D) │
│ │
│ Where: │
│ • t = term │
│ • d = document │
│ • D = document collection │
│ │
│ │
│ TERM FREQUENCY (TF): │
│ ──────────────────── │
│ │
│ How often does the term appear in THIS document? │
│ │
│ Common variants: │
│ │
│ 1. Raw count: │
│ TF(t,d) = count of t in d │
│ │
│ 2. Normalized (divide by doc length): │
│ TF(t,d) = count(t,d) / total_terms(d) │
│ │
│ 3. Log normalized: │
│ TF(t,d) = 1 + log(count(t,d)) if count > 0 │
│ 0 otherwise │
│ │
│ 4. Boolean: │
│ TF(t,d) = 1 if t in d, else 0 │
│ │
│ │
│ INVERSE DOCUMENT FREQUENCY (IDF): │
│ ───────────────────────────────── │
│ │
│ How RARE is this term across ALL documents? │
│ │
│ N │
│ IDF(t) = log ───── │
│ df(t) │
│ │
│ Where: │
│ • N = total number of documents │
│ • df(t) = number of documents containing term t │
│ │
│ Example (N = 10,000 documents): │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Term │ df(t) │ IDF = log(10000/df) │ │
│ ├─────────────┼───────┼────────────────────────────┤ │
│ │ "the" │ 9,500 │ log(10000/9500) = 0.02 │ │
│ │ "climate" │ 500 │ log(10000/500) = 3.00 │ │
│ │ "mitigation"│ 50 │ log(10000/50) = 5.30 │ │
│ │ "anthropo.."│ 5 │ log(10000/5) = 7.60 │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Rare terms → high IDF → more discriminative │
│ Common terms → low IDF → less useful for ranking │
│ │
│ │
│ WHY MULTIPLY TF × IDF? │
│ ────────────────────── │
│ │
│ Document: "Climate change mitigation strategies for │
│ climate adaptation and climate resilience" │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Term │ TF │ IDF │ TF-IDF │ Interpretation │ │
│ ├────────────┼────┼──────┼────────┼─────────────────┤ │
│ │ "climate" │ 3 │ 3.0 │ 9.0 │ Important topic │ │
│ │ "change" │ 1 │ 2.5 │ 2.5 │ Relevant term │ │
│ │ "mitigation"│ 1 │ 5.3 │ 5.3 │ Key concept │ │
│ │ "for" │ 1 │ 0.1 │ 0.1 │ Stopword ignore │ │
│ │ "and" │ 1 │ 0.05 │ 0.05 │ Stopword ignore │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ "climate" scores highest: frequent AND meaningful │
│ "for/and" score near zero: common words = low IDF │
│ │
│ │
│ TF-IDF VECTOR REPRESENTATION: │
│ ───────────────────────────── │
│ │
│ Document becomes vector in vocabulary space: │
│ │
│ Vocab: [adapt, and, change, climate, for, mitig., ...] │
│ │
│ Doc Vector: [1.2, 0.05, 2.5, 9.0, 0.1, 5.3, ...] │
│ │
│ Properties: │
│ • High-dimensional (vocabulary size, often 30K+) │
│ • Sparse (most terms don't appear in document) │
│ • Interpretable (can see which terms contribute) │
│ │
│ │
│ DOCUMENT SIMILARITY WITH TF-IDF: │
│ ──────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Doc A: [0.2, 0, 3.1, 2.8, 0, 0, ...] │ │
│ │ Doc B: [0.1, 0, 2.9, 3.0, 0, 0, ...] │ │
│ │ Doc C: [0, 5.2, 0, 0, 4.1, 0, ...] │ │
│ │ │ │
│ │ Cosine Similarity: │ │
│ │ │ │
│ │ A · B │ │
│ │ cos = ─────────── = similarity score │ │
│ │ |A| × |B| │ │
│ │ │ │
│ │ sim(A, B) = 0.97 → Very similar topics │ │
│ │ sim(A, C) = 0.05 → Very different topics │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ │
│ TF-IDF LIMITATIONS (FIXED BY BM25): │
│ ─────────────────────────────────── │
│ │
│ 1. No TF saturation: │
│ Term appearing 100x = 100× the weight of 1x │
│ (keyword stuffing vulnerability) │
│ │
│ 2. Poor length normalization: │
│ Longer docs naturally have higher TF │
│ (unfair advantage) │
│ │
│ 3. No semantic understanding: │
│ "car" ≠ "automobile" (vocabulary mismatch) │
│ │
│ BM25 addresses #1 and #2 │
│ Dense retrieval addresses #3 │
│ │
└────────────────────────────────────────────────────────────┘
TF-IDF variants:
| Variant | TF Component | IDF Component | Use Case |
|---|---|---|---|
| Classic | Raw count | log(N/df) | General purpose |
| Normalized | count/length | log(N/df) | Variable-length docs |
| Log-norm | 1 + log(count) | log(N/df) | Reduce high-TF impact |
| Sublinear | 1 + log(count) | log(N/df)+1 | ML feature extraction |
Common questions
Q: When should I use TF-IDF vs BM25?
A: Use BM25 for search ranking—it’s strictly better than raw TF-IDF because it adds term frequency saturation and document length normalization. Use TF-IDF for feature extraction in ML pipelines, document vectorization, or when you need a simple interpretable baseline that’s easy to implement and explain.
Q: Should I remove stopwords before computing TF-IDF?
A: Often yes. Stopwords (the, is, at, which) have very low IDF anyway, so they contribute little to TF-IDF scores. Removing them reduces dimensionality and computation. However, for phrase search or when word order matters, keep them. Some modern approaches let the IDF naturally downweight stopwords.
Q: How does TF-IDF compare to word embeddings?
A: TF-IDF creates sparse, interpretable vectors based on term statistics. Word embeddings (Word2Vec, BERT) create dense vectors that capture semantic meaning—they know “king” and “queen” are related. TF-IDF is faster to compute, needs no training, and is more interpretable. Embeddings capture semantics but are slower and less transparent.
Q: What’s the right vocabulary size for TF-IDF?
A: Depends on your use case. For search, include all terms (30K-100K+ vocabulary). For ML features, limit to top N by document frequency (often 5K-20K) to reduce dimensionality. Very rare terms add noise; very common terms add little discriminative power. Some implementations cap minimum/maximum document frequency.
Related terms
- BM25 — improved ranking function based on TF-IDF
- Sparse retrieval — retrieval using TF-IDF-like vectors
- Inverted index — data structure for TF-IDF search
- Embedding — dense alternative to TF-IDF vectors
References
Salton & Buckley (1988), “Term-weighting approaches in automatic text retrieval”, Information Processing & Management. [Classic TF-IDF analysis]
Sparck Jones (1972), “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation. [Original IDF concept]
Manning et al. (2008), “Introduction to Information Retrieval”, Cambridge University Press. [TF-IDF textbook treatment]
Ramos (2003), “Using TF-IDF to Determine Word Relevance in Document Queries”, ICML. [Practical TF-IDF tutorial]