Definition

Multi-Head Attention is a mechanism that performs multiple attention operations in parallel, each with different learned projections. Instead of computing a single attention function, the model runs several “heads” simultaneously, each capturing different aspects of relationships in the data. The outputs are then concatenated and projected to produce the final result.

Why it matters

Multi-Head Attention addresses limitations of single-head attention:

Diverse representations — different heads can learn different relationship types (syntactic, semantic, positional)
Richer expressiveness — capturing multiple patterns simultaneously improves model capacity
Stable training — multiple heads provide redundancy and gradient flow stability
Interpretability — individual heads often specialize in identifiable linguistic patterns

This is why Transformers use multi-head attention rather than single attention—it’s fundamentally more powerful.

How it works

┌─────────────────────────────────────────────────────────────┐
│                    MULTI-HEAD ATTENTION                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Input ──┬──→ Head 1 (W_Q1, W_K1, W_V1) ──→ Attention 1    │
│          │                                                  │
│          ├──→ Head 2 (W_Q2, W_K2, W_V2) ──→ Attention 2    │
│          │                                                  │
│          ├──→ Head 3 (W_Q3, W_K3, W_V3) ──→ Attention 3    │
│          │                                    │             │
│          └──→  ...h heads...                  ▼             │
│                                         [Concatenate]       │
│                                               │             │
│                                    Linear Projection (W_O)  │
│                                               │             │
│                                               ▼             │
│                                            Output           │
└─────────────────────────────────────────────────────────────┘

Project inputs — Q, K, V are linearly projected h times with different learned weights
Parallel attention — each head computes attention independently
Concatenate — head outputs are concatenated along the feature dimension
Final projection — concatenated output is linearly projected back to model dimension

Formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h) × W_O

Where each head_i = Attention(Q×W_Qi, K×W_Ki, V×W_Vi)

Common questions

Q: How many heads are typically used?

A: Common configurations use 8, 12, or 16 heads. GPT-3 uses 96 heads with 12,288 hidden dimension. The head count is usually chosen so each head’s dimension (d_model / num_heads) is a reasonable size like 64 or 128.

Q: Do different heads learn different things?

A: Yes, research shows heads often specialize. Some attend to adjacent words, others to syntactic dependencies, specific positions, or rare tokens. Not all heads are equally important—some can be pruned with minimal performance loss.

Q: Why not just use wider single-head attention?

A: Wider single-head attention has the same parameter count but less representational diversity. Multi-head attention’s parallel subspaces capture richer, more varied patterns.

Q: What is Group Query Attention (GQA)?

A: GQA is an efficient variant where multiple query heads share Key-Value heads, reducing memory and computation while maintaining quality. Used in models like Llama 2.

Attention Mechanism — the foundational technique
Self-Attention — each head performs self-attention
Transformer Architecture — uses multi-head attention throughout
LLM — modern language models rely on multi-head attention

References

Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]

Michel et al. (2019), “Are Sixteen Heads Really Better than One?”, NeurIPS. [1,200+ citations]

Voita et al. (2019), “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting”, ACL. [800+ citations]

Ainslie et al. (2023), “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, arXiv. [400+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References