Open any local-LLM tutorial and you'll see the same flag copied from machine to machine: -c 32768. Crank the context window to the maximum "just in case" — in case you paste a big file, in case the conversation runs long, in case you need the headroom someday. It feels free. You're not using 32K tokens, you're just allowing them.
It is not free. But the bill doesn't land where most people assume. To find out exactly what you pay, we fixed one model and one quant — Qwen2.5-Coder-7B-Instruct at Q4_K_M — and swept only the context-window setting across five sizes (2K, 4K, 8K, 16K, 32K) on a single RTX 5060 Ti running the llama.cpp CUDA build. For each, we launched a fresh llama-server, let it fully load, and measured generation speed, VRAM footprint, and prompt-evaluation latency. Everything below is reproducible on a stock release, and the raw data is open source.
The results, up front
Going from 2K to 32K context cost essentially zero generation speed — but +1.7 GB of VRAM, reserved the moment the server starts. The context-length tax is paid in memory, not throughput.
| Context | tok/s | VRAM (MiB) | TTFT proxy (ms) |
|---|---|---|---|
| 2K | 80.2 | 4,559 | 85.8 |
| 4K | 81.3 | 4,673 | 86.0 |
| 8K | 77.8 | 4,902 | 92.5 |
| 16K | 80.6 | 5,125 | 86.1 |
| 32K | 80.3 | 6,269 | 88.8 |
The speed column is flat. The VRAM column climbs in a straight line. That single contrast is the entire story — and it has direct consequences for how you should size a local model.
Why context length costs anything at all
When you launch llama-server with -c 32768, the runtime allocates a KV-cache sized for the full 32,768 tokens — right then, before your first prompt arrives. The KV-cache is where the model stores the keys and values for every token it has already seen, so it doesn't have to recompute attention over the whole history on each new token. (If the term is fuzzy, our explainer on what an LLM's KV-cache is and why it eats VRAM walks through the mechanism.)
The catch is that the model has no idea how long your actual inputs will be. It reserves the worst case up front. The cache is not grown lazily as your conversation expands — it is sized at startup for the declared maximum and held for the life of the process. You pay for 32K tokens of cache from minute one, whether you end up using 200 tokens or 32,000.
That's the source of the tax. The question this experiment answers is: what does that reservation actually cost you — in speed, in memory, in latency?
How it works
Context window grows (left) — generation speed is unaffected (middle) — VRAM reservation climbs +1.7 GB (right). The "tax" is entirely memory, not throughput.
The KV-cache is not dynamically allocated as your conversation grows — it is sized at startup for the declared maximum. You pay for 32K tokens of cache from minute one, whether you use 200 tokens or 32,000.
The setup
We kept the methodology deliberately boring so the numbers are comparable. One model (Qwen2.5-Coder-7B-Instruct Q4_K_M), one backend (CUDA), five context sizes: 2K, 4K, 8K, 16K, 32K. For each we launch a fresh llama-server, let it fully load, then measure three things:
- Generation tok/s — the shared 12-prompt suite, averaged. This is pure generation rate, greedy decoding (temperature 0), 256 output tokens per prompt.
- VRAM used — the
nvidia-smidelta between an idle GPU and the loaded server. The KV-cache reservation is the dominant contributor, and it scales with the declared context. - TTFT proxy —
timings.prompt_msfrom a fixed ~200-token prompt. This is prompt-evaluation latency, which rises with context because the attention mask grows.
All of it ran on one machine: an RTX 5060 Ti (16 GB, Blackwell) on a CUDA-12 llama.cpp build, on an otherwise-idle GPU so the VRAM delta is clean. The point of holding everything else fixed is that any change you see in the numbers is attributable to one variable — the -c value.
Results
The headline across all five runs: generation tok/s is flat at ~80 regardless of context size, ranging from 77.8 to 81.3 tok/s — well within measurement noise. VRAM, by contrast, climbs steadily from 4,559 MiB at 2K to 6,269 MiB at 32K, a clean +1,710 MiB (+1.7 GB). Time-to-first-token barely moves: 85.8 ms at 2K versus 88.8 ms at 32K, a +3 ms difference. The 8K point showed a one-off spike to 92.5 ms, almost certainly single-run noise rather than a real trend.
Three observations, each pulled straight from the table.
Generation speed is bound by model weights, not unfilled context
tok/s held at ~80 across every context size (2K: 80.2, 4K: 81.3, 8K: 77.8, 16K: 80.6, 32K: 80.3). Generation is memory-bandwidth-bound — the bottleneck on each forward pass is reading the model's weight matrices through the GPU, and those weights don't change when you declare a bigger context. The unfilled KV-cache slots simply don't factor into the per-token cost until you actually fill them with real tokens. Declaring 32K instead of 2K does not slow your output stream at all. (This is the same bandwidth ceiling that governs raw tokens-per-second on local hardware generally — the weights, not the context setting, set the pace.)
VRAM cost is real, linear, and charged immediately
VRAM went from 4,559 MiB at 2K to 6,269 MiB at 32K — that +1,710 MiB is allocated the moment llama-server starts, before a single token is generated. The relationship is roughly linear with context: each doubling adds somewhere in the 300–600 MiB range for this model and quant (4K added 114 MiB over 2K; the final jump to 32K added 1,144 MiB over 16K). On a VRAM-tight card, that 1.7 GB is not a rounding error — it can be the difference between fitting a second model, enabling a quantized KV-cache, or running out of memory entirely.
TTFT impact is modest — the real cost is memory, not latency
Time-to-first-token increased by only +3 ms from 2K to 32K (85.8 → 88.8 ms). Even the noisy 8K spike at 92.5 ms is only +6.7 ms over the 2K baseline. For practically any interactive workload this is imperceptible. If you're optimizing for latency, the context-window setting is not the knob to reach for — the story here is VRAM, full stop.
So what should you actually set?
The instinct to set context high "just in case" treats the setting as free optionality. It isn't — it's a standing memory reservation you pay for the whole time the server is up, used or not. Here's how to spend it deliberately.
Match -c to your real workload, not the worst case you can imagine. If you mostly run short chats, single-file code edits, or quick Q&A, your prompts rarely exceed a few thousand tokens. A 4K or 8K context covers that comfortably and hands you back 1+ GB of VRAM. Reserve big context for the sessions that genuinely need it — long documents pasted in, multi-turn agent loops, large codebases in the prompt.
Spend the freed VRAM where it moves the needle. That ~1.7 GB you'd burn on unused 32K headroom can instead buy you a meaningfully larger model (a bigger model at a lower context often beats a smaller one at huge context), a second model loaded alongside the first, or room for KV-cache quantization. If your card is the limiting factor, see our guide on how much VRAM you actually need to run an LLM to plan the budget.
Remember what a "token" of context really is. Thirty-two thousand tokens is a lot of headroom — roughly a small book. Most interactive prompts are a tiny fraction of that. If you're unsure how your inputs map to token counts, tokens explained covers the conversion. Sizing context well starts with a realistic estimate of how long your inputs run.
If you regularly paste large inputs, pay the cost on purpose. None of this means small context is always right. It means the choice should be a decision, not a default. When you know you'll feed the model 20K-token documents, set the context for it and budget the memory. The mistake is reserving the maximum reflexively on a workload that never approaches it.
Practical takeaway: Set
-cto what you actually need, not "just in case." The tax for unused context is almost entirely VRAM — not generation speed. On a 16 GB card, going from 2K to 32K costs ~1.7 GB of headroom you might want for a bigger model, a second model, or KV-cache quantization.
Reproduce it — and send us your numbers
This is experiment #6 in our open-source local-LLM benchmark series, and the whole point is to build a community map of how these costs scale across real hardware. Our sweep is one 7B model at one quant on one card — your model, quant, and GPU will land at different absolute numbers, and that's exactly the data worth collecting.
Everything runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B-Instruct Q4_K_M GGUF. Drop the model in your models directory, set SPECBENCH_MODELS_DIR, and run:
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py # -> results/data.json
python scripts/inject.py # bakes results into the writeup
The full sweep, harness, and our raw data.json live in the context-length-tax experiment. Run on an otherwise-idle GPU so the VRAM delta is clean, then open a pull request with your hardware's numbers — a bigger model, a quantized KV-cache, an Apple Silicon Mac, an old Pascal card. Every data point sharpens the picture of what context length really costs.
New to running models on your own hardware? Start with our complete guide to running local AI, then dig into what an LLM's KV-cache costs in VRAM.
