Definition
Self-attention (also called intra-attention) is a mechanism where each position in a sequence attends to all positions within the same sequence to compute a representation. Unlike cross-attention, which relates two different sequences, self-attention captures relationships and dependencies between different parts of a single input, enabling the model to understand how words relate to each other within a sentence.
Why it matters
Self-attention is the foundational mechanism of Transformer architectures:
- Contextual understanding — each word’s representation incorporates information from all other words in the sequence
- Long-range dependencies — captures relationships between distant words without information degradation
- Bidirectional context — in encoder models, each word sees both preceding and following context
- Parallelizable — all attention computations can run simultaneously, unlike recurrent approaches
This enables language models to understand meaning in context rather than treating words in isolation.
How it works
┌──────────────────────────────────────────────────────────┐
│ SELF-ATTENTION │
├──────────────────────────────────────────────────────────┤
│ │
│ Sequence: [The] [cat] [sat] [on] [the] [mat] │
│ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ │
│ Each token: Q ────────────────────────────┐ │
│ K ◄───────────────────────────┤ │
│ V ◄───────────────────────────┘ │
│ │
│ "cat" attends to: The(0.1) cat(0.3) sat(0.4) on(0.1)...│
│ │
│ Output: contextualized representation for each token │
└──────────────────────────────────────────────────────────┘
- Project to Q, K, V — each token generates Query, Key, and Value vectors
- Compute scores — each Query attends to all Keys in the sequence
- Apply softmax — normalize scores to attention weights
- Aggregate values — weighted sum of all Values gives contextualized output
- Result — each position’s representation now incorporates global context
Common questions
Q: How is self-attention different from cross-attention?
A: Self-attention computes relationships within one sequence (Q, K, V all come from the same input). Cross-attention relates two sequences—typically decoder queries attending to encoder outputs.
Q: What is causal/masked self-attention?
A: In decoder models (like GPT), tokens can only attend to previous tokens, not future ones. This is enforced by masking future positions, enabling autoregressive generation.
Q: Does self-attention scale quadratically?
A: Yes, complexity is O(n²) where n is sequence length, since each token attends to all others. This limits practical context window sizes and has driven research into efficient attention variants.
Related terms
- Attention Mechanism — the general technique self-attention builds on
- Transformer Architecture — uses self-attention as its core component
- Multi-Head Attention — runs multiple self-attention operations in parallel
- LLM — language models built on self-attention
References
Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]
Cheng et al. (2016), “Long Short-Term Memory-Networks for Machine Reading”, EMNLP. [1,800+ citations]
Lin et al. (2017), “A Structured Self-Attentive Sentence Embedding”, ICLR. [3,200+ citations]