Definition

The Transformer is a neural network architecture introduced in 2017 that revolutionized natural language processing. Unlike previous sequential models (RNNs, LSTMs), Transformers process all input tokens simultaneously using self-attention mechanisms, enabling massive parallelization and capturing long-range dependencies in text.

Why it matters

The Transformer architecture is the foundation of virtually all modern large language models, including GPT, BERT, Claude, and PaLM. Its ability to:

Scale efficiently — parallel processing enables training on billions of parameters
Capture context — attention mechanisms relate any word to any other word regardless of distance
Transfer knowledge — pre-trained Transformers can be fine-tuned for countless downstream tasks

This makes it essential for building AI systems that understand and generate natural language.

How it works

┌─────────────────────────────────────────────────────────┐
│                    TRANSFORMER                          │
├─────────────────────────────────────────────────────────┤
│  Input → Embedding + Position → [ENCODER] → [DECODER]  │
│                                      │          │       │
│                                      ▼          ▼       │
│                              Self-Attention  Cross-Attn │
│                                      │          │       │
│                                 Feed-Forward  Output    │
└─────────────────────────────────────────────────────────┘

Input embedding — tokens converted to dense vectors
Positional encoding — position information added (since processing is parallel)
Self-attention layers — each token attends to all others to build contextual representations
Feed-forward networks — transform attention outputs
Output generation — decoder produces final sequence

Common questions

Q: Why did Transformers replace RNNs and LSTMs?

A: RNNs process tokens sequentially, creating bottlenecks for long sequences and making parallelization impossible. Transformers process all tokens at once, enabling faster training and better long-range dependency modeling.

Q: What are encoder-only vs decoder-only Transformers?

A: Encoder-only models (like BERT) are optimized for understanding tasks (classification, NER). Decoder-only models (like GPT) are optimized for generation. The original Transformer used both.

Q: How do Transformers handle sequence order without recurrence?

A: Positional encodings are added to input embeddings, providing position information that the model learns to use during attention.

LLM — large language models built on Transformer architecture
Attention Mechanism — the core innovation enabling Transformers
Self-Attention — mechanism allowing tokens to attend to each other
Multi-Head Attention — parallel attention for richer representations

References

Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]

Devlin et al. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL. [90,000+ citations]

Lin et al. (2022), “A Survey of Transformers”, AI Open. [2,500+ citations]

Wolf et al. (2020), “Transformers: State-of-the-Art Natural Language Processing”, EMNLP. [7,500+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References