Definition
Embeddings are dense, continuous vector representations of discrete data (words, sentences, images, etc.) in a high-dimensional space. Unlike sparse representations like one-hot encoding, embeddings compress information into fixed-size vectors where similar items are located close together in the embedding space. This enables mathematical operations on semantic concepts.
Why it matters
Embeddings are foundational to modern AI systems:
- Semantic similarity — similar meanings map to nearby vectors, enabling similarity search
- Transfer learning — pre-trained embeddings capture general knowledge usable across tasks
- Dimensionality reduction — millions of possible words compress into hundreds of dimensions
- Mathematical operations — vector arithmetic reveals semantic relationships (king - man + woman ≈ queen)
Every RAG system, search engine, and recommendation system relies on embeddings to understand content.
How it works
┌─────────────────────────────────────────────────────────┐
│ EMBEDDING PROCESS │
├─────────────────────────────────────────────────────────┤
│ │
│ Input Text ─────→ Tokenize ─────→ Model ─────→ Vector │
│ │
│ "tax law" → [123, 456] → Neural → [0.12, │
│ Network 0.45, │
│ -0.23, │
│ ...] │
│ (768-D) │
│ │
│ Semantic space: │
│ "tax law" ●────────● "fiscal regulation" │
│ close │
│ "weather" ● │
│ far │
└─────────────────────────────────────────────────────────┘
- Tokenization — input text is split into tokens
- Model encoding — neural network processes tokens
- Pooling — token representations are combined (mean, CLS token, etc.)
- Output vector — fixed-size dense vector (e.g., 384, 768, or 1536 dimensions)
The embedding model is trained so that semantically similar inputs produce vectors with high cosine similarity.
Common questions
Q: What embedding dimensions are common?
A: Typical sizes range from 384 (lightweight) to 1536 (OpenAI) to 4096 (large models). Higher dimensions can capture more nuance but require more storage and computation.
Q: How do sentence embeddings differ from word embeddings?
A: Word embeddings (Word2Vec, GloVe) represent individual words. Sentence embeddings (from models like sentence-transformers) capture entire sentence meaning, handling context and word order.
Q: What are bilingual/multilingual embeddings?
A: These models map multiple languages into a shared embedding space, so “legal advice” and “juridisch advies” produce similar vectors, enabling cross-lingual search.
Q: Do embeddings drift over time?
A: Embedding models are static once trained, but if you update your embedding model, all vectors must be regenerated since different models produce incompatible spaces.
Related terms
- RAG — uses embeddings for retrieval
- Vector Database — stores and searches embeddings
- Semantic Similarity — measured via embedding distance
- LLM — uses embeddings internally
References
Mikolov et al. (2013), “Efficient Estimation of Word Representations in Vector Space”, arXiv. [40,000+ citations]
Reimers & Gurevych (2019), “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, EMNLP. [8,000+ citations]
Pennington et al. (2014), “GloVe: Global Vectors for Word Representation”, EMNLP. [35,000+ citations]
Muennighoff et al. (2022), “MTEB: Massive Text Embedding Benchmark”, arXiv. [700+ citations]