Context Length and Context Windows

Context Length and Context Windows

What is Context Length?
Context length (or context window) is the maximum number of tokens an LLM can process in a single request, including both the input prompt and generated output.

Context Lengths by Model
| Model | Max Context | Notes |
|-------|-------------|-------|
| GPT-4 Turbo | 128,000 | ~300 pages of text |
| GPT-4o | 128,000 | Most efficient |
| Claude 3.5 Sonnet | 200,000 | Largest commercial |
| Gemini 1.5 Pro | 1,000,000 | Experimental |
| Llama 3 70B | 8,192 | Base, extendable with RoPE |
| Mistral Large | 32,000 | Good balance |

Why Context Length Matters
1. Document processing: Longer context = more pages per request
2. Conversation history: More turns remembered
3. Few-shot learning: More examples in prompt
4. RAG applications: More retrieved chunks

Trade-offs of Long Context
| Longer Context | Implications |
|----------------|--------------|
| ✅ More information | Can include full documents |
| ❌ Higher cost | More tokens = higher API bills |
| ❌ Slower | More computation required |
| ❌ Lost in the middle | Models may miss information in middle of long contexts |

Extending Context
- RoPE scaling: Extend position embeddings (YaRN, NTK-aware)
- RAG: Retrieve only relevant chunks instead of full documents
- Summarization: Compress earlier context
- Sliding window: Process documents in chunks with overlap

Best Practices
- Use RAG for large document sets instead of full context
- Place important information at start and end of prompts
- Monitor "lost in the middle" effects on long contexts

Context Length and Context Windows

Want to learn more?