Definition

A context window is the maximum number of tokens that a large language model can consider at once during inference—including both the input prompt and the generated output. It represents the model’s “working memory”: everything outside this window is invisible to the model for that interaction. Context windows have grown from 2,048 tokens in early models to 128,000+ tokens in modern systems.

Why it matters

Context window size directly impacts what AI systems can accomplish:

Document analysis — larger windows enable processing entire documents without chunking
RAG relevance — more context allows retrieving and incorporating more source material
Conversational memory — longer contexts maintain coherent multi-turn dialogues
Complex reasoning — more space for chain-of-thought and examples

However, larger contexts increase computational cost quadratically with self-attention, driving innovation in efficient architectures.

How it works

┌────────────────────────────────────────────────────────────┐
│                    CONTEXT WINDOW                          │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  ◄──────────────── 128K tokens ──────────────────────────► │
│                                                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ System │  Retrieved │  User    │  Generated         │  │
│  │ Prompt │  Context   │  Query   │  Response          │  │
│  │ (500)  │  (50,000)  │  (500)   │  (4,000)          │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                            │
│  All tokens compete for attention ←→                       │
│  Tokens outside window: invisible                          │
│                                                            │
│  Model context windows:                                    │
│  GPT-3.5:     4K / 16K                                    │
│  GPT-4:       8K / 32K / 128K                             │
│  Claude:      100K / 200K                                 │
│  Gemini:      32K / 1M+                                   │
└────────────────────────────────────────────────────────────┘

Token counting — all inputs (system prompt, user input, retrieved docs) consume tokens
Allocation — remaining tokens available for model output
Attention — each token can attend to all others within the window
Truncation — content exceeding the window is cut off (typically from the start)

Common questions

Q: What happens when input exceeds the context window?

A: Most systems truncate—removing oldest content to fit. Well-designed applications prevent this via chunking, summarization, or RAG strategies that select only relevant passages.

Q: Does using the full context window affect quality?

A: Research shows “lost in the middle” effects—models attend better to content at the start and end of long contexts. Strategic placement of key information matters.

Q: How are tokens counted?

A: Depends on the tokenizer. Roughly: 1 token ≈ 4 characters or ≈ 0.75 words in English. Non-English text and code may tokenize differently.

Q: Is a larger context window always better?

A: Not necessarily. Larger windows cost more, process slower, and may dilute attention. RAG systems often outperform raw context stuffing by retrieving only relevant content.

LLM — the models that have context windows
Tokenization — how text is converted to tokens
RAG — technique for managing limited context
Transformer Architecture — why context windows exist

References

Vaswani et al. (2017), “Attention Is All You Need”, NeurIPS. [130,000+ citations]

Liu et al. (2023), “Lost in the Middle: How Language Models Use Long Contexts”, arXiv. [600+ citations]

Anthropic (2024), “Claude’s Context Window”, Anthropic Documentation.

Beltagy et al. (2020), “Longformer: The Long-Document Transformer”, arXiv. [3,500+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References