Definition

An attention mechanism is a component in neural networks that enables models to dynamically focus on relevant parts of the input sequence when generating each part of the output. Rather than compressing all input into a single fixed-size representation, attention allows the model to selectively “attend to” different input elements based on their relevance to the current task.

Why it matters

Attention mechanisms solve fundamental limitations of earlier sequence models:

Handling long sequences — attention directly connects distant elements without information loss through many steps
Interpretability — attention weights reveal which inputs influenced each output, aiding debugging and trust
Parallelization — attention computations can run simultaneously, unlike sequential RNNs
Dynamic context — the model learns what to focus on rather than using fixed patterns

Attention is the core innovation behind Transformers and modern language models.

How it works

┌──────────────────────────────────────────────────────┐
│                  ATTENTION MECHANISM                 │
├──────────────────────────────────────────────────────┤
│                                                      │
│  Query (Q) ─────┐                                    │
│                 ├──→ Score ──→ Softmax ──→ Weights   │
│  Key (K) ───────┘                           │        │
│                                             ▼        │
│  Value (V) ─────────────────────────→ Weighted Sum   │
│                                             │        │
│                                             ▼        │
│                                         Output       │
└──────────────────────────────────────────────────────┘

Query, Key, Value — input is transformed into three representations
Scoring — queries are compared against keys to compute relevance scores
Softmax — scores are normalized to sum to 1 (attention weights)
Weighted combination — values are combined using attention weights
Output — contextually-informed representation for each position

The formula: Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Common questions

Q: What is the difference between attention and self-attention?

A: Standard attention computes relevance between two different sequences (e.g., encoder outputs and decoder state). Self-attention computes relevance within a single sequence—each element attends to all others in the same sequence.

Q: Why divide by √d_k in the attention formula?

A: This “scaled dot-product attention” prevents dot products from growing too large in high dimensions, which would push softmax into regions with vanishing gradients.

Q: Can attention be visualized?

A: Yes, attention weights can be plotted as heatmaps showing which input tokens each output token attended to, providing interpretability.

Transformer Architecture — built entirely on attention mechanisms
Self-Attention — attention within a single sequence
Multi-Head Attention — parallel attention for diverse representations
LLM — language models powered by attention

References

Bahdanau et al. (2015), “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR. [35,000+ citations]

Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]

Luong et al. (2015), “Effective Approaches to Attention-based Neural Machine Translation”, EMNLP. [12,000+ citations]

Galassi et al. (2020), “Attention in Natural Language Processing”, IEEE TNNLS. [1,000+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References