Definition
The Transformer is a neural network architecture introduced in 2017 that revolutionized natural language processing. Unlike previous sequential models (RNNs, LSTMs), Transformers process all input tokens simultaneously using self-attention mechanisms, enabling massive parallelization and capturing long-range dependencies in text.
Why it matters
The Transformer architecture is the foundation of virtually all modern large language models, including GPT, BERT, Claude, and PaLM. Its ability to:
- Scale efficiently — parallel processing enables training on billions of parameters
- Capture context — attention mechanisms relate any word to any other word regardless of distance
- Transfer knowledge — pre-trained Transformers can be fine-tuned for countless downstream tasks
This makes it essential for building AI systems that understand and generate natural language.
How it works
┌─────────────────────────────────────────────────────────┐
│ TRANSFORMER │
├─────────────────────────────────────────────────────────┤
│ Input → Embedding + Position → [ENCODER] → [DECODER] │
│ │ │ │
│ ▼ ▼ │
│ Self-Attention Cross-Attn │
│ │ │ │
│ Feed-Forward Output │
└─────────────────────────────────────────────────────────┘
- Input embedding — tokens converted to dense vectors
- Positional encoding — position information added (since processing is parallel)
- Self-attention layers — each token attends to all others to build contextual representations
- Feed-forward networks — transform attention outputs
- Output generation — decoder produces final sequence
Common questions
Q: Why did Transformers replace RNNs and LSTMs?
A: RNNs process tokens sequentially, creating bottlenecks for long sequences and making parallelization impossible. Transformers process all tokens at once, enabling faster training and better long-range dependency modeling.
Q: What are encoder-only vs decoder-only Transformers?
A: Encoder-only models (like BERT) are optimized for understanding tasks (classification, NER). Decoder-only models (like GPT) are optimized for generation. The original Transformer used both.
Q: How do Transformers handle sequence order without recurrence?
A: Positional encodings are added to input embeddings, providing position information that the model learns to use during attention.
Related terms
- LLM — large language models built on Transformer architecture
- Attention Mechanism — the core innovation enabling Transformers
- Self-Attention — mechanism allowing tokens to attend to each other
- Multi-Head Attention — parallel attention for richer representations
References
Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]
Devlin et al. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL. [90,000+ citations]
Lin et al. (2022), “A Survey of Transformers”, AI Open. [2,500+ citations]
Wolf et al. (2020), “Transformers: State-of-the-Art Natural Language Processing”, EMNLP. [7,500+ citations]