Definition

Self-attention (also called intra-attention) is a mechanism where each position in a sequence attends to all positions within the same sequence to compute a representation. Unlike cross-attention, which relates two different sequences, self-attention captures relationships and dependencies between different parts of a single input, enabling the model to understand how words relate to each other within a sentence.

Why it matters

Self-attention is the foundational mechanism of Transformer architectures:

Contextual understanding — each word’s representation incorporates information from all other words in the sequence
Long-range dependencies — captures relationships between distant words without information degradation
Bidirectional context — in encoder models, each word sees both preceding and following context
Parallelizable — all attention computations can run simultaneously, unlike recurrent approaches

This enables language models to understand meaning in context rather than treating words in isolation.

How it works

┌──────────────────────────────────────────────────────────┐
│                      SELF-ATTENTION                      │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  Sequence: [The] [cat] [sat] [on] [the] [mat]           │
│               │     │     │    │    │     │              │
│               ▼     ▼     ▼    ▼    ▼     ▼              │
│  Each token:  Q ────────────────────────────┐            │
│               K ◄───────────────────────────┤            │
│               V ◄───────────────────────────┘            │
│                                                          │
│  "cat" attends to: The(0.1) cat(0.3) sat(0.4) on(0.1)...│
│                                                          │
│  Output: contextualized representation for each token    │
└──────────────────────────────────────────────────────────┘

Project to Q, K, V — each token generates Query, Key, and Value vectors
Compute scores — each Query attends to all Keys in the sequence
Apply softmax — normalize scores to attention weights
Aggregate values — weighted sum of all Values gives contextualized output
Result — each position’s representation now incorporates global context

Common questions

Q: How is self-attention different from cross-attention?

A: Self-attention computes relationships within one sequence (Q, K, V all come from the same input). Cross-attention relates two sequences—typically decoder queries attending to encoder outputs.

Q: What is causal/masked self-attention?

A: In decoder models (like GPT), tokens can only attend to previous tokens, not future ones. This is enforced by masking future positions, enabling autoregressive generation.

Q: Does self-attention scale quadratically?

A: Yes, complexity is O(n²) where n is sequence length, since each token attends to all others. This limits practical context window sizes and has driven research into efficient attention variants.

Attention Mechanism — the general technique self-attention builds on
Transformer Architecture — uses self-attention as its core component
Multi-Head Attention — runs multiple self-attention operations in parallel
LLM — language models built on self-attention

References

Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]

Cheng et al. (2016), “Long Short-Term Memory-Networks for Machine Reading”, EMNLP. [1,800+ citations]

Lin et al. (2017), “A Structured Self-Attentive Sentence Embedding”, ICLR. [3,200+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References