Definition
Tokenization is the process of converting raw text into discrete units called tokens that language models can process. These tokens can be words, subwords, characters, or byte-level units. Modern LLMs typically use subword tokenization (like BPE or SentencePiece), which balances vocabulary size with the ability to handle rare and compound words by breaking them into meaningful pieces.
Why it matters
Tokenization is the critical first step in all language model operations:
- Model input — LLMs don’t see text; they see sequences of token IDs
- Context limits — token count (not character count) determines what fits in the context window
- Cost calculation — API pricing is based on tokens consumed
- Multilingual support — tokenizer design affects how efficiently different languages are processed
- Model capabilities — poor tokenization of code, math, or non-English text degrades performance
The choice of tokenizer fundamentally shapes what a model can understand and how efficiently.
How it works
┌────────────────────────────────────────────────────────────┐
│ TOKENIZATION │
├────────────────────────────────────────────────────────────┤
│ │
│ Input: "Tokenization handles unfamiliar words" │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Subword tokenization (BPE example): │ │
│ │ │ │
│ │ "Token" "ization" " handles" " un" "familiar" │ │
│ │ " words" │ │
│ │ │ │
│ │ Token IDs: [14402, 2065, 17082, 653, 74, 4339] │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ Different tokenizers for same text: │
│ ┌─────────────────────────────────────────────┐ │
│ │ GPT-4: "Hello" → [15496] │ │
│ │ Llama: "Hello" → [12345] │ │
│ │ Different models = different token IDs │ │
│ └─────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
- Pre-processing — text is normalized (Unicode, whitespace handling)
- Segmentation — text is split using learned rules (BPE, WordPiece, etc.)
- Mapping — each token is converted to a unique integer ID
- Special tokens — markers like
<BOS>,<EOS>,<PAD>are added as needed
Common algorithms:
- BPE (Byte Pair Encoding) — GPT models, merges frequent character pairs
- WordPiece — BERT, similar to BPE with different scoring
- SentencePiece — language-agnostic, treats text as raw Unicode
- Tiktoken — OpenAI’s fast BPE implementation
Common questions
Q: Why not just split on spaces and words?
A: Word-level tokenization creates huge vocabularies and can’t handle unknown words. Subword tokenization balances vocabulary size (~50K-100K tokens) with coverage of any text, including rare words and neologisms.
Q: Why do non-English languages use more tokens?
A: Tokenizers trained primarily on English may split non-English words into more pieces. This means the same content costs more tokens and uses more context window in other languages.
Q: How do I count tokens before sending to an API?
A: Use the model’s tokenizer library (e.g., tiktoken for OpenAI). Roughly estimate: 1 token ≈ 4 characters in English. Most providers offer token counting tools.
Q: Can tokenization cause model errors?
A: Yes. Unusual tokenization of numbers, code, or special characters can confuse models. Understanding tokenization helps debug unexpected behaviors.
Related terms
- LLM — models that use tokenization for input
- Context Window — measured in tokens
- Embeddings — vectors represent tokens
- BPE — common tokenization algorithm
References
Sennrich et al. (2016), “Neural Machine Translation of Rare Words with Subword Units”, ACL. [11,000+ citations]
Kudo & Richardson (2018), “SentencePiece: A simple and language independent subword tokenizer and detokenizer”, EMNLP. [3,000+ citations]
Radford et al. (2019), “Language Models are Unsupervised Multitask Learners”, OpenAI. [GPT-2 paper, 15,000+ citations]
Rust et al. (2021), “How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models”, ACL. [400+ citations]