What is a context window in LLMs?

A context window is the maximum number of tokens an LLM can process in a single request, including both the input (your prompt, documents, conversation history) and the output (the model's response). Think of it as the model's working memory—anything beyond this limit simply cannot be seen or processed by the model.

Which LLM has the largest context window?

As of 2026, Google's Gemini 1.5 Pro offers the largest context window at 1 million tokens (approximately 700,000 words or 1,500+ pages). Claude 3 offers 200K tokens, GPT-4 Turbo provides 128K tokens. Context window sizes continue to expand with each model generation.

What happens if my input exceeds the context window?

Most APIs will return an error if you exceed the context window. Some models truncate the input silently, which can cause unexpected behavior. Always check your token count before sending requests and implement strategies like chunking, summarization, or RAG to handle documents that exceed limits.

Is a larger context window always better?

**Not necessarily**. Larger context windows cost more (you pay per token), and models may struggle with "lost in the middle" problems where information in the center of long contexts is poorly recalled. For many tasks, shorter, more relevant context outperforms dumping entire documents into the prompt.

What is the 'lost in the middle' problem?

Research has shown that LLMs attend more strongly to information at the beginning and end of long contexts, while information in the middle may be poorly recalled. This means critical information placed in the middle of a 100K-token context might be missed or given less weight than information at the edges.

How do I process a document larger than the context window?

Several strategies work: 1. Chunk the document and process pieces independently 2. Use RAG to retrieve only relevant sections 3. Create a hierarchical summary first, then query the summary 4. Use map-reduce where you process chunks, then synthesize results 5. Use a model with a larger context window.

What's the difference between context window and max output tokens?

Context window is the total capacity for input plus output. Max output tokens is a separate limit on how much the model can generate in response. For example, GPT-4 Turbo has a 128K context window but a 4K max output limit—so with a 120K input, you could only get 4K output, not 8K.

Does conversation history count toward the context window?

**Yes**, in chat applications, all previous messages in the conversation consume context window space. A 20-turn conversation might use 50K+ tokens before your current question is even asked. This is why long conversations eventually need summarization or truncation of older messages.

Context Window Limits: Managing Long Documents in LLMs

You've crafted the perfect prompt and have a 50-page document to analyze. You hit send, and... error: "This model's maximum context length is 8192 tokens." Welcome to one of the most common challenges when working with LLMs: context window limits.

Understanding context windows—and how to work around their limitations—is essential for building practical AI applications. This guide explains what context windows are, compares limits across models, and provides strategies for processing documents that exceed those limits.

Understanding Context Windows

What Is a Context Window?

The context window is the total number of tokens an LLM can process in a single request. This includes:

System prompt (persistent instructions)
Conversation history (previous messages)
User input (current prompt/question)
Document content (files, retrieved context)
Model output (the response being generated)

Context Window = System + History + Input + Documents + Output

If any combination of these exceeds the context window, the request fails or the model truncates content.

Current Context Window Sizes

Model	Context Window	Approximate Capacity
Gemini 1.5 Pro	1,000,000 tokens	~700K words, 1,500 pages
Claude 3 (all variants)	200,000 tokens	~150K words, 300 pages
GPT-4 Turbo	128,000 tokens	~100K words, 200 pages
GPT-4o	128,000 tokens	~100K words, 200 pages
Llama 3 70B	8,192-128,000 tokens	Varies by provider
Mistral Large	32,000 tokens	~25K words, 50 pages
GPT-3.5 Turbo	16,384 tokens	~12K words, 25 pages

Context Window vs. Output Limit

Don't confuse context window with maximum output:

Model	Context Window	Max Output
GPT-4 Turbo	128K	4,096 tokens
Claude 3 Opus	200K	4,096 tokens
Gemini 1.5 Pro	1M	8,192 tokens

Even with 128K input capacity, you can't generate 128K tokens of output—the output limit is separate and much smaller.

The "Lost in the Middle" Problem

Research from Stanford and UC Berkeley demonstrated that LLMs don't process long contexts uniformly. Information retrieval accuracy follows a U-shaped pattern:

Accuracy
    ^
100%|  *                                    *
    |    *                                *
 75%|      *                            *
    |        *                        *
 50%|          *    *    *    *    *
    +---------------------------------------->
      Start        Middle          End
                Position in Context

Key findings:

Information at the beginning: ~90% recall
Information in the middle: ~50-70% recall
Information at the end: ~85% recall

This has practical implications: don't bury critical information in the middle of long contexts.

Mitigation Strategies

1. Strategic placement: Put the most important information at the beginning or end of your context.

2. Repeat key information: Include critical details in both the context and the prompt.

3. Use retrieval: Instead of dumping entire documents, retrieve only relevant sections.

4. Structured formatting: Use clear headers and sections to help the model navigate long contexts.

Strategy 1: Chunking

When documents exceed context limits, break them into processable pieces.

Fixed-Size Chunking

The simplest approach—split by token count:

def chunk_by_tokens(text: str, chunk_size: int = 4000, overlap: int = 200) -> list[str]:
    tokens = tokenizer.encode(text)
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk_tokens))

    return chunks

Overlap ensures context continuity—information at chunk boundaries isn't lost.

Semantic Chunking

More sophisticated—split at natural boundaries:

def chunk_by_sections(text: str) -> list[str]:
    # Split by headers, paragraphs, or semantic boundaries
    sections = re.split(r'\n#{1,3}\s', text)  # Split on markdown headers
    return [s for s in sections if len(s.strip()) > 100]

Recursive Chunking

LangChain's approach—try larger boundaries first, fall back to smaller:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
)

chunks = splitter.split_text(document)

Strategy 2: Map-Reduce Processing

Process chunks independently, then synthesize results.

The Map Phase

Apply the same operation to each chunk:

def map_summarize(chunks: list[str]) -> list[str]:
    summaries = []
    for chunk in chunks:
        prompt = f"Summarize this section:\n\n{chunk}"
        summary = llm.complete(prompt)
        summaries.append(summary)
    return summaries

The Reduce Phase

Combine chunk results into a final answer:

def reduce_summaries(summaries: list[str], question: str) -> str:
    combined = "\n\n".join(summaries)
    prompt = f"""Based on these section summaries:

{combined}

Answer: {question}"""
    return llm.complete(prompt)

Full Map-Reduce Pipeline

def answer_from_long_document(document: str, question: str) -> str:
    # 1. Chunk the document
    chunks = chunk_by_tokens(document, chunk_size=4000)

    # 2. Map: Extract relevant info from each chunk
    extractions = []
    for chunk in chunks:
        prompt = f"Extract information relevant to: {question}\n\nText: {chunk}"
        extraction = llm.complete(prompt)
        extractions.append(extraction)

    # 3. Reduce: Synthesize final answer
    combined = "\n\n".join([e for e in extractions if e.strip()])
    final_prompt = f"Based on this information:\n{combined}\n\nAnswer: {question}"

    return llm.complete(final_prompt)

Strategy 3: Retrieval-Augmented Generation (RAG)

Instead of processing entire documents, retrieve only relevant sections.

Basic RAG Flow

Document → Chunk → Embed → Vector Store
                              ↓
Query → Embed → Retrieve Top-K → Generate Answer

Implementation Example

from sentence_transformers import SentenceTransformer
import chromadb

# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.Client()
collection = db.create_collection("documents")

# Index documents
def index_document(doc_id: str, text: str):
    chunks = chunk_by_tokens(text, chunk_size=500)
    embeddings = embedder.encode(chunks)

    collection.add(
        ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
        embeddings=embeddings.tolist(),
        documents=chunks
    )

# Query
def query_documents(question: str, top_k: int = 5) -> str:
    query_embedding = embedder.encode([question])[0]

    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )

    context = "\n\n".join(results['documents'][0])

    prompt = f"""Context: {context}

Question: {question}

Answer based on the context above:"""

    return llm.complete(prompt)

When to Use RAG vs. Full Context

Use RAG When	Use Full Context When
Document is very large (100K+ tokens)	Document fits in context
Only need specific information	Need holistic understanding
Multiple documents to search	Single focused document
Questions are specific	Questions require full context
Cost is a concern	Quality is paramount

Strategy 4: Hierarchical Summarization

Create summaries at multiple levels for efficient navigation.

Building a Summary Hierarchy

def build_summary_tree(document: str) -> dict:
    # Level 3: Paragraph summaries
    paragraphs = document.split('\n\n')
    para_summaries = [summarize(p, max_tokens=50) for p in paragraphs]

    # Level 2: Section summaries (groups of paragraphs)
    sections = chunk_list(para_summaries, chunk_size=10)
    section_summaries = [summarize('\n'.join(s), max_tokens=100) for s in sections]

    # Level 1: Document summary
    doc_summary = summarize('\n'.join(section_summaries), max_tokens=200)

    return {
        "document_summary": doc_summary,
        "section_summaries": section_summaries,
        "paragraph_summaries": para_summaries,
        "full_text": document
    }

Querying the Hierarchy

def hierarchical_query(tree: dict, question: str) -> str:
    # Start with document summary to identify relevant sections
    relevant_sections = identify_relevant_sections(
        tree["document_summary"],
        tree["section_summaries"],
        question
    )

    # Get detailed content from relevant sections only
    detailed_context = get_section_content(tree, relevant_sections)

    # Answer from focused context
    return llm.complete(f"Context: {detailed_context}\n\nQuestion: {question}")

Strategy 5: Conversation Management

In chat applications, context accumulates with every turn.

The Problem

Turn 1: System (500) + User (100) + Assistant (200) = 800 tokens
Turn 2: System (500) + History (800) + User (150) + Assistant (250) = 1,700 tokens
Turn 3: System (500) + History (1,700) + User (200) + Assistant (300) = 2,700 tokens
...
Turn 20: Context overflow!

Solution 1: Sliding Window

Keep only the most recent N turns:

def sliding_window_context(messages: list, max_turns: int = 10) -> list:
    system_messages = [m for m in messages if m['role'] == 'system']
    conversation = [m for m in messages if m['role'] != 'system']

    # Keep only recent turns
    recent = conversation[-max_turns * 2:]  # *2 for user+assistant pairs

    return system_messages + recent

Solution 2: Summarize Old Context

def summarize_history(messages: list, threshold: int = 50000) -> list:
    current_tokens = count_tokens(messages)

    if current_tokens < threshold:
        return messages

    # Summarize older messages
    system = messages[0]  # Keep system prompt
    old_messages = messages[1:-4]  # All but last 2 turns
    recent_messages = messages[-4:]  # Keep last 2 turns

    summary = llm.complete(f"Summarize this conversation:\n{format_messages(old_messages)}")

    return [
        system,
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *recent_messages
    ]

Solution 3: Hybrid Approach

def manage_context(messages: list, max_tokens: int = 100000) -> list:
    current = count_tokens(messages)

    if current <= max_tokens:
        return messages

    # Try sliding window first
    windowed = sliding_window_context(messages, max_turns=20)
    if count_tokens(windowed) <= max_tokens:
        return windowed

    # Fall back to summarization
    return summarize_history(messages, threshold=max_tokens * 0.8)

Choosing the Right Long-Context Model

Decision Framework

Document size < 16K tokens?
├─► Yes → Most models work (GPT-3.5, Mistral, Llama)
└─► No
    Document size < 128K tokens?
    ├─► Yes → GPT-4 Turbo, GPT-4o, Llama 3 (via some providers)
    └─► No
        Document size < 200K tokens?
        ├─► Yes → Claude 3 (any variant)
        └─► No
            Document size < 1M tokens?
            ├─► Yes → Gemini 1.5 Pro
            └─► No → Must use chunking/RAG

Cost Considerations

Larger context doesn't mean you should use it all:

Scenario	Model	Context Used	Cost (per request)
Short query	Claude 3 Haiku	5K tokens	$0.001
Full document	Claude 3 Haiku	100K tokens	$0.025
Full document	Claude 3 Sonnet	100K tokens	$0.30

Using maximum context is 25-300x more expensive than minimal context.

Best Practices Summary

Do:

Estimate token counts before sending requests
Use retrieval for targeted information extraction
Place critical information at start/end of context
Implement conversation management for chat apps
Monitor for "lost in the middle" issues

Don't:

Dump entire documents when only sections matter
Ignore output token limits when planning context
Trust that models process all context equally
Exceed context limits without error handling
Pay for 200K context when 5K would suffice

Conclusion

Context windows define what's possible with a single LLM call. Understanding these limits—and the strategies to work around them—is fundamental to building effective AI applications.

Key takeaways:

Know your limits: Different models have vastly different capacities
Less is often more: Focused context often outperforms full-document dumps
Use the right strategy: Chunking, RAG, and summarization each have their place
Mind the middle: Information placement affects recall accuracy
Manage conversations: Chat histories grow fast; plan for it

Use our LLM Token Counter to check whether your documents fit within context limits and estimate costs before processing.