Home/Blog/Context Window Limits: Managing Long Documents in LLMs
AI & Machine Learning

Context Window Limits: Managing Long Documents in LLMs

Learn how to work within LLM context window limits, process documents longer than the model supports, and choose the right long-context model for your needs.

By Inventive HQ Team
Context Window Limits: Managing Long Documents in LLMs

You've crafted the perfect prompt and have a 50-page document to analyze. You hit send, and... error: "This model's maximum context length is 8192 tokens." Welcome to one of the most common challenges when working with LLMs: context window limits.

Understanding context windows—and how to work around their limitations—is essential for building practical AI applications. This guide explains what context windows are, compares limits across models, and provides strategies for processing documents that exceed those limits.

Understanding Context Windows

What Is a Context Window?

The context window is the total number of tokens an LLM can process in a single request. This includes:

  • System prompt (persistent instructions)
  • Conversation history (previous messages)
  • User input (current prompt/question)
  • Document content (files, retrieved context)
  • Model output (the response being generated)
Context Window = System + History + Input + Documents + Output

If any combination of these exceeds the context window, the request fails or the model truncates content.

Current Context Window Sizes

ModelContext WindowApproximate Capacity
Gemini 1.5 Pro1,000,000 tokens~700K words, 1,500 pages
Claude 3 (all variants)200,000 tokens~150K words, 300 pages
GPT-4 Turbo128,000 tokens~100K words, 200 pages
GPT-4o128,000 tokens~100K words, 200 pages
Llama 3 70B8,192-128,000 tokensVaries by provider
Mistral Large32,000 tokens~25K words, 50 pages
GPT-3.5 Turbo16,384 tokens~12K words, 25 pages

Context Window vs. Output Limit

Don't confuse context window with maximum output:

ModelContext WindowMax Output
GPT-4 Turbo128K4,096 tokens
Claude 3 Opus200K4,096 tokens
Gemini 1.5 Pro1M8,192 tokens

Even with 128K input capacity, you can't generate 128K tokens of output—the output limit is separate and much smaller.

The "Lost in the Middle" Problem

Research from Stanford and UC Berkeley demonstrated that LLMs don't process long contexts uniformly. Information retrieval accuracy follows a U-shaped pattern:

Accuracy
    ^
100%|  *                                    *
    |    *                                *
 75%|      *                            *
    |        *                        *
 50%|          *    *    *    *    *
    +---------------------------------------->
      Start        Middle          End
                Position in Context

Key findings:

  • Information at the beginning: ~90% recall
  • Information in the middle: ~50-70% recall
  • Information at the end: ~85% recall

This has practical implications: don't bury critical information in the middle of long contexts.

Mitigation Strategies

1. Strategic placement: Put the most important information at the beginning or end of your context.

2. Repeat key information: Include critical details in both the context and the prompt.

3. Use retrieval: Instead of dumping entire documents, retrieve only relevant sections.

4. Structured formatting: Use clear headers and sections to help the model navigate long contexts.

Strategy 1: Chunking

When documents exceed context limits, break them into processable pieces.

Fixed-Size Chunking

The simplest approach—split by token count:

def chunk_by_tokens(text: str, chunk_size: int = 4000, overlap: int = 200) -> list[str]:
    tokens = tokenizer.encode(text)
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk_tokens))

    return chunks

Overlap ensures context continuity—information at chunk boundaries isn't lost.

Semantic Chunking

More sophisticated—split at natural boundaries:

def chunk_by_sections(text: str) -> list[str]:
    # Split by headers, paragraphs, or semantic boundaries
    sections = re.split(r'\n#{1,3}\s', text)  # Split on markdown headers
    return [s for s in sections if len(s.strip()) > 100]

Recursive Chunking

LangChain's approach—try larger boundaries first, fall back to smaller:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
)

chunks = splitter.split_text(document)

Strategy 2: Map-Reduce Processing

Process chunks independently, then synthesize results.

The Map Phase

Apply the same operation to each chunk:

def map_summarize(chunks: list[str]) -> list[str]:
    summaries = []
    for chunk in chunks:
        prompt = f"Summarize this section:\n\n{chunk}"
        summary = llm.complete(prompt)
        summaries.append(summary)
    return summaries

The Reduce Phase

Combine chunk results into a final answer:

def reduce_summaries(summaries: list[str], question: str) -> str:
    combined = "\n\n".join(summaries)
    prompt = f"""Based on these section summaries:

{combined}

Answer: {question}"""
    return llm.complete(prompt)

Full Map-Reduce Pipeline

def answer_from_long_document(document: str, question: str) -> str:
    # 1. Chunk the document
    chunks = chunk_by_tokens(document, chunk_size=4000)

    # 2. Map: Extract relevant info from each chunk
    extractions = []
    for chunk in chunks:
        prompt = f"Extract information relevant to: {question}\n\nText: {chunk}"
        extraction = llm.complete(prompt)
        extractions.append(extraction)

    # 3. Reduce: Synthesize final answer
    combined = "\n\n".join([e for e in extractions if e.strip()])
    final_prompt = f"Based on this information:\n{combined}\n\nAnswer: {question}"

    return llm.complete(final_prompt)

Strategy 3: Retrieval-Augmented Generation (RAG)

Instead of processing entire documents, retrieve only relevant sections.

Basic RAG Flow

Document → Chunk → Embed → Vector Store
                              ↓
Query → Embed → Retrieve Top-K → Generate Answer

Implementation Example

from sentence_transformers import SentenceTransformer
import chromadb

# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.Client()
collection = db.create_collection("documents")

# Index documents
def index_document(doc_id: str, text: str):
    chunks = chunk_by_tokens(text, chunk_size=500)
    embeddings = embedder.encode(chunks)

    collection.add(
        ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
        embeddings=embeddings.tolist(),
        documents=chunks
    )

# Query
def query_documents(question: str, top_k: int = 5) -> str:
    query_embedding = embedder.encode([question])[0]

    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )

    context = "\n\n".join(results['documents'][0])

    prompt = f"""Context: {context}

Question: {question}

Answer based on the context above:"""

    return llm.complete(prompt)

When to Use RAG vs. Full Context

Use RAG WhenUse Full Context When
Document is very large (100K+ tokens)Document fits in context
Only need specific informationNeed holistic understanding
Multiple documents to searchSingle focused document
Questions are specificQuestions require full context
Cost is a concernQuality is paramount

Strategy 4: Hierarchical Summarization

Create summaries at multiple levels for efficient navigation.

Building a Summary Hierarchy

def build_summary_tree(document: str) -> dict:
    # Level 3: Paragraph summaries
    paragraphs = document.split('\n\n')
    para_summaries = [summarize(p, max_tokens=50) for p in paragraphs]

    # Level 2: Section summaries (groups of paragraphs)
    sections = chunk_list(para_summaries, chunk_size=10)
    section_summaries = [summarize('\n'.join(s), max_tokens=100) for s in sections]

    # Level 1: Document summary
    doc_summary = summarize('\n'.join(section_summaries), max_tokens=200)

    return {
        "document_summary": doc_summary,
        "section_summaries": section_summaries,
        "paragraph_summaries": para_summaries,
        "full_text": document
    }

Querying the Hierarchy

def hierarchical_query(tree: dict, question: str) -> str:
    # Start with document summary to identify relevant sections
    relevant_sections = identify_relevant_sections(
        tree["document_summary"],
        tree["section_summaries"],
        question
    )

    # Get detailed content from relevant sections only
    detailed_context = get_section_content(tree, relevant_sections)

    # Answer from focused context
    return llm.complete(f"Context: {detailed_context}\n\nQuestion: {question}")

Strategy 5: Conversation Management

In chat applications, context accumulates with every turn.

The Problem

Turn 1: System (500) + User (100) + Assistant (200) = 800 tokens
Turn 2: System (500) + History (800) + User (150) + Assistant (250) = 1,700 tokens
Turn 3: System (500) + History (1,700) + User (200) + Assistant (300) = 2,700 tokens
...
Turn 20: Context overflow!

Solution 1: Sliding Window

Keep only the most recent N turns:

def sliding_window_context(messages: list, max_turns: int = 10) -> list:
    system_messages = [m for m in messages if m['role'] == 'system']
    conversation = [m for m in messages if m['role'] != 'system']

    # Keep only recent turns
    recent = conversation[-max_turns * 2:]  # *2 for user+assistant pairs

    return system_messages + recent

Solution 2: Summarize Old Context

def summarize_history(messages: list, threshold: int = 50000) -> list:
    current_tokens = count_tokens(messages)

    if current_tokens < threshold:
        return messages

    # Summarize older messages
    system = messages[0]  # Keep system prompt
    old_messages = messages[1:-4]  # All but last 2 turns
    recent_messages = messages[-4:]  # Keep last 2 turns

    summary = llm.complete(f"Summarize this conversation:\n{format_messages(old_messages)}")

    return [
        system,
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *recent_messages
    ]

Solution 3: Hybrid Approach

def manage_context(messages: list, max_tokens: int = 100000) -> list:
    current = count_tokens(messages)

    if current <= max_tokens:
        return messages

    # Try sliding window first
    windowed = sliding_window_context(messages, max_turns=20)
    if count_tokens(windowed) <= max_tokens:
        return windowed

    # Fall back to summarization
    return summarize_history(messages, threshold=max_tokens * 0.8)

Choosing the Right Long-Context Model

Decision Framework

Document size < 16K tokens?
├─► Yes → Most models work (GPT-3.5, Mistral, Llama)
└─► No
    Document size < 128K tokens?
    ├─► Yes → GPT-4 Turbo, GPT-4o, Llama 3 (via some providers)
    └─► No
        Document size < 200K tokens?
        ├─► Yes → Claude 3 (any variant)
        └─► No
            Document size < 1M tokens?
            ├─► Yes → Gemini 1.5 Pro
            └─► No → Must use chunking/RAG

Cost Considerations

Larger context doesn't mean you should use it all:

ScenarioModelContext UsedCost (per request)
Short queryClaude 3 Haiku5K tokens$0.001
Full documentClaude 3 Haiku100K tokens$0.025
Full documentClaude 3 Sonnet100K tokens$0.30

Using maximum context is 25-300x more expensive than minimal context.

Best Practices Summary

Do:

  • Estimate token counts before sending requests
  • Use retrieval for targeted information extraction
  • Place critical information at start/end of context
  • Implement conversation management for chat apps
  • Monitor for "lost in the middle" issues

Don't:

  • Dump entire documents when only sections matter
  • Ignore output token limits when planning context
  • Trust that models process all context equally
  • Exceed context limits without error handling
  • Pay for 200K context when 5K would suffice

Conclusion

Context windows define what's possible with a single LLM call. Understanding these limits—and the strategies to work around them—is fundamental to building effective AI applications.

Key takeaways:

  1. Know your limits: Different models have vastly different capacities
  2. Less is often more: Focused context often outperforms full-document dumps
  3. Use the right strategy: Chunking, RAG, and summarization each have their place
  4. Mind the middle: Information placement affects recall accuracy
  5. Manage conversations: Chat histories grow fast; plan for it

Use our LLM Token Counter to check whether your documents fit within context limits and estimate costs before processing.

Frequently Asked Questions

Find answers to common questions

A context window is the maximum number of tokens an LLM can process in a single request, including both the input (your prompt, documents, conversation history) and the output (the model's response). Think of it as the model's working memory—anything beyond this limit simply cannot be seen or processed by the model.

Let's turn this knowledge into action

Get a free 30-minute consultation with our experts. We'll help you apply these insights to your specific situation.