Definition
An attention mechanism is a component in neural networks that enables models to dynamically focus on relevant parts of the input sequence when generating each part of the output. Rather than compressing all input into a single fixed-size representation, attention allows the model to selectively “attend to” different input elements based on their relevance to the current task.
Why it matters
Attention mechanisms solve fundamental limitations of earlier sequence models:
- Handling long sequences — attention directly connects distant elements without information loss through many steps
- Interpretability — attention weights reveal which inputs influenced each output, aiding debugging and trust
- Parallelization — attention computations can run simultaneously, unlike sequential RNNs
- Dynamic context — the model learns what to focus on rather than using fixed patterns
Attention is the core innovation behind Transformers and modern language models.
How it works
┌──────────────────────────────────────────────────────┐
│ ATTENTION MECHANISM │
├──────────────────────────────────────────────────────┤
│ │
│ Query (Q) ─────┐ │
│ ├──→ Score ──→ Softmax ──→ Weights │
│ Key (K) ───────┘ │ │
│ ▼ │
│ Value (V) ─────────────────────────→ Weighted Sum │
│ │ │
│ ▼ │
│ Output │
└──────────────────────────────────────────────────────┘
- Query, Key, Value — input is transformed into three representations
- Scoring — queries are compared against keys to compute relevance scores
- Softmax — scores are normalized to sum to 1 (attention weights)
- Weighted combination — values are combined using attention weights
- Output — contextually-informed representation for each position
The formula: Attention(Q,K,V) = softmax(QK^T / √d_k) × V
Common questions
Q: What is the difference between attention and self-attention?
A: Standard attention computes relevance between two different sequences (e.g., encoder outputs and decoder state). Self-attention computes relevance within a single sequence—each element attends to all others in the same sequence.
Q: Why divide by √d_k in the attention formula?
A: This “scaled dot-product attention” prevents dot products from growing too large in high dimensions, which would push softmax into regions with vanishing gradients.
Q: Can attention be visualized?
A: Yes, attention weights can be plotted as heatmaps showing which input tokens each output token attended to, providing interpretability.
Related terms
- Transformer Architecture — built entirely on attention mechanisms
- Self-Attention — attention within a single sequence
- Multi-Head Attention — parallel attention for diverse representations
- LLM — language models powered by attention
References
Bahdanau et al. (2015), “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR. [35,000+ citations]
Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]
Luong et al. (2015), “Effective Approaches to Attention-based Neural Machine Translation”, EMNLP. [12,000+ citations]
Galassi et al. (2020), “Attention in Natural Language Processing”, IEEE TNNLS. [1,000+ citations]