You've crafted the perfect prompt and have a 50-page document to analyze. You hit send, and... error: "This model's maximum context length is 8192 tokens." Welcome to one of the most common challenges when working with LLMs: context window limits.
Understanding context windows—and how to work around their limitations—is essential for building practical AI applications. This guide explains what context windows are, compares limits across models, and provides strategies for processing documents that exceed those limits.
Understanding Context Windows
What Is a Context Window?
The context window is the total number of tokens an LLM can process in a single request. This includes:
- System prompt (persistent instructions)
- Conversation history (previous messages)
- User input (current prompt/question)
- Document content (files, retrieved context)
- Model output (the response being generated)
Context Window = System + History + Input + Documents + Output
If any combination of these exceeds the context window, the request fails or the model truncates content.
Current Context Window Sizes
| Model | Context Window | Approximate Capacity |
|---|---|---|
| Gemini 1.5 Pro | 1,000,000 tokens | ~700K words, 1,500 pages |
| Claude 3 (all variants) | 200,000 tokens | ~150K words, 300 pages |
| GPT-4 Turbo | 128,000 tokens | ~100K words, 200 pages |
| GPT-4o | 128,000 tokens | ~100K words, 200 pages |
| Llama 3 70B | 8,192-128,000 tokens | Varies by provider |
| Mistral Large | 32,000 tokens | ~25K words, 50 pages |
| GPT-3.5 Turbo | 16,384 tokens | ~12K words, 25 pages |
Context Window vs. Output Limit
Don't confuse context window with maximum output:
| Model | Context Window | Max Output |
|---|---|---|
| GPT-4 Turbo | 128K | 4,096 tokens |
| Claude 3 Opus | 200K | 4,096 tokens |
| Gemini 1.5 Pro | 1M | 8,192 tokens |
Even with 128K input capacity, you can't generate 128K tokens of output—the output limit is separate and much smaller.
The "Lost in the Middle" Problem
Research from Stanford and UC Berkeley demonstrated that LLMs don't process long contexts uniformly. Information retrieval accuracy follows a U-shaped pattern:
Accuracy
^
100%| * *
| * *
75%| * *
| * *
50%| * * * * *
+---------------------------------------->
Start Middle End
Position in Context
Key findings:
- Information at the beginning: ~90% recall
- Information in the middle: ~50-70% recall
- Information at the end: ~85% recall
This has practical implications: don't bury critical information in the middle of long contexts.
Mitigation Strategies
1. Strategic placement: Put the most important information at the beginning or end of your context.
2. Repeat key information: Include critical details in both the context and the prompt.
3. Use retrieval: Instead of dumping entire documents, retrieve only relevant sections.
4. Structured formatting: Use clear headers and sections to help the model navigate long contexts.
Strategy 1: Chunking
When documents exceed context limits, break them into processable pieces.
Fixed-Size Chunking
The simplest approach—split by token count:
def chunk_by_tokens(text: str, chunk_size: int = 4000, overlap: int = 200) -> list[str]:
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunks.append(tokenizer.decode(chunk_tokens))
return chunks
Overlap ensures context continuity—information at chunk boundaries isn't lost.
Semantic Chunking
More sophisticated—split at natural boundaries:
def chunk_by_sections(text: str) -> list[str]:
# Split by headers, paragraphs, or semantic boundaries
sections = re.split(r'\n#{1,3}\s', text) # Split on markdown headers
return [s for s in sections if len(s.strip()) > 100]
Recursive Chunking
LangChain's approach—try larger boundaries first, fall back to smaller:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=4000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
)
chunks = splitter.split_text(document)
Strategy 2: Map-Reduce Processing
Process chunks independently, then synthesize results.
The Map Phase
Apply the same operation to each chunk:
def map_summarize(chunks: list[str]) -> list[str]:
summaries = []
for chunk in chunks:
prompt = f"Summarize this section:\n\n{chunk}"
summary = llm.complete(prompt)
summaries.append(summary)
return summaries
The Reduce Phase
Combine chunk results into a final answer:
def reduce_summaries(summaries: list[str], question: str) -> str:
combined = "\n\n".join(summaries)
prompt = f"""Based on these section summaries:
{combined}
Answer: {question}"""
return llm.complete(prompt)
Full Map-Reduce Pipeline
def answer_from_long_document(document: str, question: str) -> str:
# 1. Chunk the document
chunks = chunk_by_tokens(document, chunk_size=4000)
# 2. Map: Extract relevant info from each chunk
extractions = []
for chunk in chunks:
prompt = f"Extract information relevant to: {question}\n\nText: {chunk}"
extraction = llm.complete(prompt)
extractions.append(extraction)
# 3. Reduce: Synthesize final answer
combined = "\n\n".join([e for e in extractions if e.strip()])
final_prompt = f"Based on this information:\n{combined}\n\nAnswer: {question}"
return llm.complete(final_prompt)
Strategy 3: Retrieval-Augmented Generation (RAG)
Instead of processing entire documents, retrieve only relevant sections.
Basic RAG Flow
Document → Chunk → Embed → Vector Store
↓
Query → Embed → Retrieve Top-K → Generate Answer
Implementation Example
from sentence_transformers import SentenceTransformer
import chromadb
# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.Client()
collection = db.create_collection("documents")
# Index documents
def index_document(doc_id: str, text: str):
chunks = chunk_by_tokens(text, chunk_size=500)
embeddings = embedder.encode(chunks)
collection.add(
ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
embeddings=embeddings.tolist(),
documents=chunks
)
# Query
def query_documents(question: str, top_k: int = 5) -> str:
query_embedding = embedder.encode([question])[0]
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=top_k
)
context = "\n\n".join(results['documents'][0])
prompt = f"""Context: {context}
Question: {question}
Answer based on the context above:"""
return llm.complete(prompt)
When to Use RAG vs. Full Context
| Use RAG When | Use Full Context When |
|---|---|
| Document is very large (100K+ tokens) | Document fits in context |
| Only need specific information | Need holistic understanding |
| Multiple documents to search | Single focused document |
| Questions are specific | Questions require full context |
| Cost is a concern | Quality is paramount |
Strategy 4: Hierarchical Summarization
Create summaries at multiple levels for efficient navigation.
Building a Summary Hierarchy
def build_summary_tree(document: str) -> dict:
# Level 3: Paragraph summaries
paragraphs = document.split('\n\n')
para_summaries = [summarize(p, max_tokens=50) for p in paragraphs]
# Level 2: Section summaries (groups of paragraphs)
sections = chunk_list(para_summaries, chunk_size=10)
section_summaries = [summarize('\n'.join(s), max_tokens=100) for s in sections]
# Level 1: Document summary
doc_summary = summarize('\n'.join(section_summaries), max_tokens=200)
return {
"document_summary": doc_summary,
"section_summaries": section_summaries,
"paragraph_summaries": para_summaries,
"full_text": document
}
Querying the Hierarchy
def hierarchical_query(tree: dict, question: str) -> str:
# Start with document summary to identify relevant sections
relevant_sections = identify_relevant_sections(
tree["document_summary"],
tree["section_summaries"],
question
)
# Get detailed content from relevant sections only
detailed_context = get_section_content(tree, relevant_sections)
# Answer from focused context
return llm.complete(f"Context: {detailed_context}\n\nQuestion: {question}")
Strategy 5: Conversation Management
In chat applications, context accumulates with every turn.
The Problem
Turn 1: System (500) + User (100) + Assistant (200) = 800 tokens
Turn 2: System (500) + History (800) + User (150) + Assistant (250) = 1,700 tokens
Turn 3: System (500) + History (1,700) + User (200) + Assistant (300) = 2,700 tokens
...
Turn 20: Context overflow!
Solution 1: Sliding Window
Keep only the most recent N turns:
def sliding_window_context(messages: list, max_turns: int = 10) -> list:
system_messages = [m for m in messages if m['role'] == 'system']
conversation = [m for m in messages if m['role'] != 'system']
# Keep only recent turns
recent = conversation[-max_turns * 2:] # *2 for user+assistant pairs
return system_messages + recent
Solution 2: Summarize Old Context
def summarize_history(messages: list, threshold: int = 50000) -> list:
current_tokens = count_tokens(messages)
if current_tokens < threshold:
return messages
# Summarize older messages
system = messages[0] # Keep system prompt
old_messages = messages[1:-4] # All but last 2 turns
recent_messages = messages[-4:] # Keep last 2 turns
summary = llm.complete(f"Summarize this conversation:\n{format_messages(old_messages)}")
return [
system,
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*recent_messages
]
Solution 3: Hybrid Approach
def manage_context(messages: list, max_tokens: int = 100000) -> list:
current = count_tokens(messages)
if current <= max_tokens:
return messages
# Try sliding window first
windowed = sliding_window_context(messages, max_turns=20)
if count_tokens(windowed) <= max_tokens:
return windowed
# Fall back to summarization
return summarize_history(messages, threshold=max_tokens * 0.8)
Choosing the Right Long-Context Model
Decision Framework
Document size < 16K tokens?
├─► Yes → Most models work (GPT-3.5, Mistral, Llama)
└─► No
Document size < 128K tokens?
├─► Yes → GPT-4 Turbo, GPT-4o, Llama 3 (via some providers)
└─► No
Document size < 200K tokens?
├─► Yes → Claude 3 (any variant)
└─► No
Document size < 1M tokens?
├─► Yes → Gemini 1.5 Pro
└─► No → Must use chunking/RAG
Cost Considerations
Larger context doesn't mean you should use it all:
| Scenario | Model | Context Used | Cost (per request) |
|---|---|---|---|
| Short query | Claude 3 Haiku | 5K tokens | $0.001 |
| Full document | Claude 3 Haiku | 100K tokens | $0.025 |
| Full document | Claude 3 Sonnet | 100K tokens | $0.30 |
Using maximum context is 25-300x more expensive than minimal context.
Best Practices Summary
Do:
- Estimate token counts before sending requests
- Use retrieval for targeted information extraction
- Place critical information at start/end of context
- Implement conversation management for chat apps
- Monitor for "lost in the middle" issues
Don't:
- Dump entire documents when only sections matter
- Ignore output token limits when planning context
- Trust that models process all context equally
- Exceed context limits without error handling
- Pay for 200K context when 5K would suffice
Conclusion
Context windows define what's possible with a single LLM call. Understanding these limits—and the strategies to work around them—is fundamental to building effective AI applications.
Key takeaways:
- Know your limits: Different models have vastly different capacities
- Less is often more: Focused context often outperforms full-document dumps
- Use the right strategy: Chunking, RAG, and summarization each have their place
- Mind the middle: Information placement affects recall accuracy
- Manage conversations: Chat histories grow fast; plan for it
Use our LLM Token Counter to check whether your documents fit within context limits and estimate costs before processing.