On a hosted API, context is a token-cost and recall problem (see managing long documents and context windows for AI coding). On hardware you own, it's a memory problem. This post covers the memory side.
The KV cache: what it is and why it exists
A decoder-only transformer generates one token at a time. To produce the next token, the attention mechanism needs the key and value vectors for every token that came before it. The naive approach would recompute those keys and values for the entire sequence on every single step — quadratic work that gets slower as the text grows.
The KV cache is the optimization that avoids it. When the model processes a token, it computes that token's key and value vectors once and stores them in GPU memory. Every subsequent step reuses them. Generation becomes linear in the number of new tokens instead of quadratic in the sequence length, which is the difference between "usable" and "unusable" for any real prompt.
The catch is that this cache is not free. It is a per-token, per-layer block of memory that lives in VRAM for the entire life of the request, and it only ever grows. On a hosted API you never see it — the provider eats the memory and charges you per token. On your own GPU, that memory comes straight out of the VRAM budget you were hoping to spend on a bigger model.
The memory formula
The size of the KV cache follows directly from the shape of the attention layers:
kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem
Each term:
2— separate tensors for Keys and Values.n_layers— every transformer layer keeps its own cache.n_kv_heads × head_dim— the KV width per token, per layer. Under plain multi-head attention (MHA) this equals the full attention width. Under grouped-query attention (GQA) it is much smaller (more on that below).seq_len— current tokens in context (prompt + everything generated so far). This is the term that moves.bytes_per_elem— 2 for FP16/BF16, 1 for FP8/INT8 KV, ~0.5 for Q4 KV.
Worked example, a Llama-3 70B-class model: n_layers = 80, n_kv_heads = 8 (GQA), head_dim = 128, FP16 KV. The per-token cost is:
2 × 80 × 8 × 128 × 2 bytes = 327,680 bytes ≈ 0.3125 MB per token
So every token of context costs 0.3125 MB of VRAM. That sounds trivial until you multiply by a real context length. For comparison, the weights of that same 70B model at Q4_K_M are roughly 40 GB (see how much VRAM you need to run an LLM). As the table below shows, the KV cache catches up with — and then overtakes — the weights well before you reach the model's advertised context window.
| Context length | KV cache (FP16) | KV cache (FP8/INT8) |
|---|---|---|
| 4K | ~1.25 GB | ~0.625 GB |
| 8K | ~2.5 GB | ~1.25 GB |
| 32K | ~10 GB | ~5 GB |
| 64K | ~20 GB | ~10 GB |
| 128K | ~40 GB | ~20 GB |
| 256K | ~80 GB | ~40 GB |
At 128K tokens the FP16 cache (~40 GB) is already as large as the entire quantized model. At 256K it is double the weights. The context window stopped being a recall feature and became the dominant line item in your VRAM budget.
Doubling context can double your VRAM
The formula has one variable that you control at runtime — seq_len — and every other term is a constant fixed by the architecture. That means KV-cache memory is strictly linear in context length. Go from 32K to 64K and you add another ~10 GB. Go from 64K to 128K and you add ~20 GB more. There is no economy of scale; the curve is a straight line through the origin.
The chart makes the trap obvious. The solid line is the FP16 cache; the dashed gray line is the model's own weights. They cross at 128K tokens, and from there the cache pulls away. The lower blue line shows what an FP8 KV cache buys you — it halves the slope, pushing the crossover with the weights out to 256K. Raising a local context limit is never "just a config flag": each increment is a fixed, predictable VRAM tax, and on a 24 GB or 32 GB consumer card you run out of headroom long before the line reaches the right edge of this graph.
Taming the KV cache
Four levers actually move the number. The first two are architectural (chosen when you pick a model); the last two are runtime decisions.
| Lever | Mechanism | KV memory effect |
|---|---|---|
| GQA / MQA | Fewer KV heads shared across query heads | 4–8x smaller (GQA); up to ~full attention-width smaller (MQA) |
| Quantized KV cache | Store K/V at FP8/INT8 or Q4 instead of FP16 | 2x (FP8/INT8) to ~4x (Q4) smaller |
| PagedAttention (vLLM) | Non-contiguous fixed-size pages instead of one buffer per sequence | Reclaims wasted slack; enables high-batch concurrency |
| Context length / batch size | Use the seq_len and concurrency you actually need | Linear — the most direct control |
GQA and MQA attack the n_kv_heads term. Multi-head attention gives every query head its own KV head. Grouped-query attention shares one KV head across a group of query heads — the 70B example above has 64 query heads but only 8 KV heads, an 8x reduction in cache versus full MHA. Multi-query attention is the extreme case of one KV head for the whole layer. This is why the same model without GQA would need 8x the numbers in our table — 128K tokens would demand ~320 GB of FP16 KV instead of 40 GB. GQA is not a tuning option; it is the reason long context is feasible at all.
Quantized KV cache attacks bytes_per_elem. Dropping K and V from FP16 (2 bytes) to FP8 or INT8 (1 byte) halves the cache for a small, usually negligible, quality hit — that is the entire right-hand column of the table above. Q4 KV (~0.5 bytes) quarters it, with a larger but often acceptable accuracy cost at long context. For a memory-bound setup this is frequently the single highest-leverage change.
PagedAttention, the core trick in vLLM, doesn't shrink the per-token cost — it stops you from wasting memory. Naive engines allocate one contiguous buffer per sequence sized for the worst case, leaving large gaps unused. PagedAttention stores the KV in fixed-size, non-contiguous "pages" (like virtual memory), giving near-zero fragmentation, prefix sharing across requests, and far higher batch concurrency on the same card. If you're serving multiple users, this is what keeps the GPU full instead of stranded.
Finally, context length and batch size are the levers you control per request. Concurrent requests each carry their own KV cache, so batch size multiplies the cost. The cheapest optimization is honesty about how much context the workload genuinely needs.
Why "1M context" is impractical on consumer GPUs
Run the formula to its conclusion. At 1M tokens, our 70B GQA model needs:
1,048,576 tokens × 0.3125 MB/token ≈ 320 GB of FP16 KV cache
That is the cache alone — before the ~40 GB of weights, before activation buffers, before the 1–2 GB of CUDA runtime overhead. 320 GB is roughly four 80 GB data-center GPUs wired together just to hold the keys and values. Quantize the KV to FP8 and you halve it to ~160 GB — still two such cards. No 24 GB RTX 4090 or 32 GB RTX 5090 is within an order of magnitude of this.
Even a comparatively lean 8B GQA model (32 layers, 8 KV heads, head_dim 128) costs ~0.125 MB/token, which is ~16 GB of FP16 KV at 128K — already most of a single 24 GB card before you add the 5 GB of weights. The headline "1M token context" you see on hosted GPT-5.5, Claude Opus, and Gemini Pro is a property of clustered H100/H200-class infrastructure with FP8 KV and PagedAttention, not something a desktop GPU was ever going to reach. The hosted long-context model exists precisely because the memory math doesn't close on consumer hardware.
Practical guidance
The actionable version of all this is short:
- Pick the context length you actually use, not the maximum the model allows. Most coding and document tasks live comfortably under 32K. Setting a 200K window "just in case" reserves VRAM you'll never fill.
- Prefer GQA/MQA models for any local long-context work. It's the difference between 40 GB and 320 GB at 128K.
- Turn on FP8/INT8 KV quantization before you reach for a bigger GPU. It's a free halving of the most expensive term.
- Model the full budget — weights + KV + overhead — before you buy hardware. Plug your model, quant, and target context into the LLM VRAM calculator; it takes context length as an input and shows the KV cache explicitly. If you're not sure what your current card can hold, check which LLMs you can run.
- Compare owning the hardware against paying per token. A long-context workload that needs multiple 80 GB GPUs may be cheaper to rent by the token; run the break-even in the self-hosted LLM cost calculator.
Conclusion
On a hosted API, context length is a number on an invoice. On your own GPU it is a line item in VRAM that grows linearly with every token and, past about 128K on a 70B model, becomes larger than the model itself. The KV-cache formula — 2 × layers × kv_heads × head_dim × seq_len × bytes — is the whole story: GQA shrinks the heads, quantization shrinks the bytes, PagedAttention stops you from wasting the rest, and seq_len is the dial you actually choose. Budget for the cache before you commit to the context window, and the 1M-token fantasy resolves into a concrete, answerable hardware question.