Definition

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that measures how important a word is to a document within a collection. It multiplies two components: term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across all documents). Words that appear frequently in one document but rarely across the collection get high TF-IDF scores, making them useful for document characterization, search ranking, and feature extraction.

Why it matters

TF-IDF is foundational to information retrieval and NLP:

Search engines — early Google used TF-IDF; it remains relevant in modern search
Document similarity — compare documents using TF-IDF vectors
Feature extraction — convert text to numerical features for ML models
Keyword extraction — identify important terms in documents
Text classification — traditional ML classifiers use TF-IDF features
Foundation for BM25 — understanding TF-IDF helps understand modern ranking

Despite being 50+ years old, TF-IDF concepts underpin modern search and NLP.

How it works

┌────────────────────────────────────────────────────────────┐
│                        TF-IDF                               │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  THE TF-IDF FORMULA:                                       │
│  ──────────────────                                        │
│                                                            │
│  TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)                   │
│                                                            │
│  Where:                                                    │
│  • t = term                                                │
│  • d = document                                            │
│  • D = document collection                                 │
│                                                            │
│                                                            │
│  TERM FREQUENCY (TF):                                      │
│  ────────────────────                                      │
│                                                            │
│  How often does the term appear in THIS document?          │
│                                                            │
│  Common variants:                                          │
│                                                            │
│  1. Raw count:                                             │
│     TF(t,d) = count of t in d                             │
│                                                            │
│  2. Normalized (divide by doc length):                     │
│     TF(t,d) = count(t,d) / total_terms(d)                 │
│                                                            │
│  3. Log normalized:                                        │
│     TF(t,d) = 1 + log(count(t,d))  if count > 0          │
│               0                     otherwise              │
│                                                            │
│  4. Boolean:                                               │
│     TF(t,d) = 1 if t in d, else 0                        │
│                                                            │
│                                                            │
│  INVERSE DOCUMENT FREQUENCY (IDF):                         │
│  ─────────────────────────────────                         │
│                                                            │
│  How RARE is this term across ALL documents?               │
│                                                            │
│                 N                                          │
│  IDF(t) = log ─────                                        │
│               df(t)                                        │
│                                                            │
│  Where:                                                    │
│  • N     = total number of documents                       │
│  • df(t) = number of documents containing term t          │
│                                                            │
│  Example (N = 10,000 documents):                           │
│                                                            │
│  ┌────────────────────────────────────────────────────┐   │
│  │ Term        │ df(t) │ IDF = log(10000/df)         │   │
│  ├─────────────┼───────┼────────────────────────────┤   │
│  │ "the"       │ 9,500 │ log(10000/9500) = 0.02     │   │
│  │ "climate"   │ 500   │ log(10000/500)  = 3.00     │   │
│  │ "mitigation"│ 50    │ log(10000/50)   = 5.30     │   │
│  │ "anthropo.."│ 5     │ log(10000/5)    = 7.60     │   │
│  └────────────────────────────────────────────────────┘   │
│                                                            │
│  Rare terms → high IDF → more discriminative              │
│  Common terms → low IDF → less useful for ranking         │
│                                                            │
│                                                            │
│  WHY MULTIPLY TF × IDF?                                    │
│  ──────────────────────                                    │
│                                                            │
│  Document: "Climate change mitigation strategies for      │
│             climate adaptation and climate resilience"     │
│                                                            │
│  ┌────────────────────────────────────────────────────┐   │
│  │ Term       │ TF │ IDF  │ TF-IDF │ Interpretation  │   │
│  ├────────────┼────┼──────┼────────┼─────────────────┤   │
│  │ "climate"  │ 3  │ 3.0  │ 9.0    │ Important topic │   │
│  │ "change"   │ 1  │ 2.5  │ 2.5    │ Relevant term   │   │
│  │ "mitigation"│ 1  │ 5.3  │ 5.3   │ Key concept     │   │
│  │ "for"      │ 1  │ 0.1  │ 0.1    │ Stopword ignore │   │
│  │ "and"      │ 1  │ 0.05 │ 0.05   │ Stopword ignore │   │
│  └────────────────────────────────────────────────────┘   │
│                                                            │
│  "climate" scores highest: frequent AND meaningful        │
│  "for/and" score near zero: common words = low IDF        │
│                                                            │
│                                                            │
│  TF-IDF VECTOR REPRESENTATION:                             │
│  ─────────────────────────────                             │
│                                                            │
│  Document becomes vector in vocabulary space:              │
│                                                            │
│  Vocab: [adapt, and, change, climate, for, mitig., ...]   │
│                                                            │
│  Doc Vector: [1.2, 0.05, 2.5, 9.0, 0.1, 5.3, ...]        │
│                                                            │
│  Properties:                                               │
│  • High-dimensional (vocabulary size, often 30K+)         │
│  • Sparse (most terms don't appear in document)           │
│  • Interpretable (can see which terms contribute)         │
│                                                            │
│                                                            │
│  DOCUMENT SIMILARITY WITH TF-IDF:                          │
│  ────────────────────────────────                          │
│                                                            │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                                                      │  │
│  │  Doc A: [0.2, 0, 3.1, 2.8, 0, 0, ...]              │  │
│  │  Doc B: [0.1, 0, 2.9, 3.0, 0, 0, ...]              │  │
│  │  Doc C: [0, 5.2, 0, 0, 4.1, 0, ...]                │  │
│  │                                                      │  │
│  │  Cosine Similarity:                                 │  │
│  │                                                      │  │
│  │           A · B                                     │  │
│  │  cos = ─────────── = similarity score              │  │
│  │        |A| × |B|                                    │  │
│  │                                                      │  │
│  │  sim(A, B) = 0.97  →  Very similar topics          │  │
│  │  sim(A, C) = 0.05  →  Very different topics        │  │
│  │                                                      │  │
│  └─────────────────────────────────────────────────────┘  │
│                                                            │
│                                                            │
│  TF-IDF LIMITATIONS (FIXED BY BM25):                       │
│  ───────────────────────────────────                       │
│                                                            │
│  1. No TF saturation:                                      │
│     Term appearing 100x = 100× the weight of 1x           │
│     (keyword stuffing vulnerability)                       │
│                                                            │
│  2. Poor length normalization:                             │
│     Longer docs naturally have higher TF                  │
│     (unfair advantage)                                     │
│                                                            │
│  3. No semantic understanding:                             │
│     "car" ≠ "automobile" (vocabulary mismatch)            │
│                                                            │
│  BM25 addresses #1 and #2                                  │
│  Dense retrieval addresses #3                              │
│                                                            │
└────────────────────────────────────────────────────────────┘

TF-IDF variants:

Variant	TF Component	IDF Component	Use Case
Classic	Raw count	log(N/df)	General purpose
Normalized	count/length	log(N/df)	Variable-length docs
Log-norm	1 + log(count)	log(N/df)	Reduce high-TF impact
Sublinear	1 + log(count)	log(N/df)+1	ML feature extraction

Common questions

Q: When should I use TF-IDF vs BM25?

A: Use BM25 for search ranking—it’s strictly better than raw TF-IDF because it adds term frequency saturation and document length normalization. Use TF-IDF for feature extraction in ML pipelines, document vectorization, or when you need a simple interpretable baseline that’s easy to implement and explain.

Q: Should I remove stopwords before computing TF-IDF?

A: Often yes. Stopwords (the, is, at, which) have very low IDF anyway, so they contribute little to TF-IDF scores. Removing them reduces dimensionality and computation. However, for phrase search or when word order matters, keep them. Some modern approaches let the IDF naturally downweight stopwords.

Q: How does TF-IDF compare to word embeddings?

A: TF-IDF creates sparse, interpretable vectors based on term statistics. Word embeddings (Word2Vec, BERT) create dense vectors that capture semantic meaning—they know “king” and “queen” are related. TF-IDF is faster to compute, needs no training, and is more interpretable. Embeddings capture semantics but are slower and less transparent.

Q: What’s the right vocabulary size for TF-IDF?

A: Depends on your use case. For search, include all terms (30K-100K+ vocabulary). For ML features, limit to top N by document frequency (often 5K-20K) to reduce dimensionality. Very rare terms add noise; very common terms add little discriminative power. Some implementations cap minimum/maximum document frequency.

BM25 — improved ranking function based on TF-IDF
Sparse retrieval — retrieval using TF-IDF-like vectors
Inverted index — data structure for TF-IDF search
Embedding — dense alternative to TF-IDF vectors

References

Salton & Buckley (1988), “Term-weighting approaches in automatic text retrieval”, Information Processing & Management. [Classic TF-IDF analysis]

Sparck Jones (1972), “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation. [Original IDF concept]

Manning et al. (2008), “Introduction to Information Retrieval”, Cambridge University Press. [TF-IDF textbook treatment]

Ramos (2003), “Using TF-IDF to Determine Word Relevance in Document Queries”, ICML. [Practical TF-IDF tutorial]

Definition

Why it matters

How it works

Common questions

Related terms

References