What is the KV cache in an LLM?

It's the stored attention keys and values for every token the model has already processed. Each new token attends to all previous tokens, so instead of recomputing their keys and values at every step, the model caches them in GPU memory. That cache makes generation fast, but it grows linearly with sequence length.

Why does long context use so much VRAM locally?

KV-cache size scales linearly with context length, number of layers, number of KV heads, and bytes per element. Doubling the context roughly doubles the cache. At long context the cache can match or exceed the size of the model weights themselves — a 70B model at 128K tokens needs about 40 GB of FP16 KV cache, roughly equal to its ~40 GB of Q4 weights — and double them by 256K.

How do I reduce KV-cache memory?

Use a model with grouped-query attention (GQA) or multi-query attention (MQA), which cut KV heads by 8x or more; quantize the KV cache to FP8/INT8 (half the memory) or Q4 (quarter); use only the context length you actually need; and serve with PagedAttention (vLLM) to eliminate the wasted slack of naive contiguous allocation. Smaller batch sizes also reduce concurrent cache.

Why can't I just run a 1M-token context locally?

The KV cache at 1M tokens is enormous. A 70B GQA model would need about 320 GB of FP16 KV cache — roughly four 80 GB data-center GPUs just for the cache, before weights. Even with FP8 KV that's ~160 GB. This is why frontier 1M-context models run on clustered server GPUs, not consumer cards.

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

On a hosted API, context is a token-cost and recall problem (see managing long documents and context windows for AI coding). On hardware you own, it's a memory problem. This post covers the memory side.

The KV cache: what it is and why it exists

A decoder-only transformer generates one token at a time. To produce the next token, the attention mechanism needs the key and value vectors for every token that came before it. The naive approach would recompute those keys and values for the entire sequence on every single step — quadratic work that gets slower as the text grows.

The KV cache is the optimization that avoids it. When the model processes a token, it computes that token's key and value vectors once and stores them in GPU memory. Every subsequent step reuses them. Generation becomes linear in the number of new tokens instead of quadratic in the sequence length, which is the difference between "usable" and "unusable" for any real prompt.

The catch is that this cache is not free. It is a per-token, per-layer block of memory that lives in VRAM for the entire life of the request, and it only ever grows. On a hosted API you never see it — the provider eats the memory and charges you per token. On your own GPU, that memory comes straight out of the VRAM budget you were hoping to spend on a bigger model.

The memory formula

The size of the KV cache follows directly from the shape of the attention layers:

kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem

Each term:

2 — separate tensors for Keys and Values.
n_layers — every transformer layer keeps its own cache.
n_kv_heads × head_dim — the KV width per token, per layer. Under plain multi-head attention (MHA) this equals the full attention width. Under grouped-query attention (GQA) it is much smaller (more on that below).
seq_len — current tokens in context (prompt + everything generated so far). This is the term that moves.
bytes_per_elem — 2 for FP16/BF16, 1 for FP8/INT8 KV, ~0.5 for Q4 KV.

Worked example, a Llama-3 70B-class model: n_layers = 80, n_kv_heads = 8 (GQA), head_dim = 128, FP16 KV. The per-token cost is:

2 × 80 × 8 × 128 × 2 bytes = 327,680 bytes ≈ 0.3125 MB per token

So every token of context costs 0.3125 MB of VRAM. That sounds trivial until you multiply by a real context length. For comparison, the weights of that same 70B model at Q4_K_M are roughly 40 GB (see how much VRAM you need to run an LLM). As the table below shows, the KV cache catches up with — and then overtakes — the weights well before you reach the model's advertised context window.

Context length	KV cache (FP16)	KV cache (FP8/INT8)
4K	~1.25 GB	~0.625 GB
8K	~2.5 GB	~1.25 GB
32K	~10 GB	~5 GB
64K	~20 GB	~10 GB
128K	~40 GB	~20 GB
256K	~80 GB	~40 GB

At 128K tokens the FP16 cache (~40 GB) is already as large as the entire quantized model. At 256K it is double the weights. The context window stopped being a recall feature and became the dominant line item in your VRAM budget.

Doubling context can double your VRAM

The formula has one variable that you control at runtime — seq_len — and every other term is a constant fixed by the architecture. That means KV-cache memory is strictly linear in context length. Go from 32K to 64K and you add another ~10 GB. Go from 64K to 128K and you add ~20 GB more. There is no economy of scale; the curve is a straight line through the origin.

The chart makes the trap obvious. The solid line is the FP16 cache; the dashed gray line is the model's own weights. They cross at 128K tokens, and from there the cache pulls away. The lower blue line shows what an FP8 KV cache buys you — it halves the slope, pushing the crossover with the weights out to 256K. Raising a local context limit is never "just a config flag": each increment is a fixed, predictable VRAM tax, and on a 24 GB or 32 GB consumer card you run out of headroom long before the line reaches the right edge of this graph.

Taming the KV cache

Four levers actually move the number. The first two are architectural (chosen when you pick a model); the last two are runtime decisions.

Lever	Mechanism	KV memory effect
GQA / MQA	Fewer KV heads shared across query heads	4–8x smaller (GQA); up to ~full attention-width smaller (MQA)
Quantized KV cache	Store K/V at FP8/INT8 or Q4 instead of FP16	2x (FP8/INT8) to ~4x (Q4) smaller
PagedAttention (vLLM)	Non-contiguous fixed-size pages instead of one buffer per sequence	Reclaims wasted slack; enables high-batch concurrency
Context length / batch size	Use the `seq_len` and concurrency you actually need	Linear — the most direct control

GQA and MQA attack the n_kv_heads term. Multi-head attention gives every query head its own KV head. Grouped-query attention shares one KV head across a group of query heads — the 70B example above has 64 query heads but only 8 KV heads, an 8x reduction in cache versus full MHA. Multi-query attention is the extreme case of one KV head for the whole layer. This is why the same model without GQA would need 8x the numbers in our table — 128K tokens would demand ~320 GB of FP16 KV instead of 40 GB. GQA is not a tuning option; it is the reason long context is feasible at all.

Quantized KV cache attacks bytes_per_elem. Dropping K and V from FP16 (2 bytes) to FP8 or INT8 (1 byte) halves the cache for a small, usually negligible, quality hit — that is the entire right-hand column of the table above. Q4 KV (~0.5 bytes) quarters it, with a larger but often acceptable accuracy cost at long context. For a memory-bound setup this is frequently the single highest-leverage change.

PagedAttention, the core trick in vLLM, doesn't shrink the per-token cost — it stops you from wasting memory. Naive engines allocate one contiguous buffer per sequence sized for the worst case, leaving large gaps unused. PagedAttention stores the KV in fixed-size, non-contiguous "pages" (like virtual memory), giving near-zero fragmentation, prefix sharing across requests, and far higher batch concurrency on the same card. If you're serving multiple users, this is what keeps the GPU full instead of stranded.

Finally, context length and batch size are the levers you control per request. Concurrent requests each carry their own KV cache, so batch size multiplies the cost. The cheapest optimization is honesty about how much context the workload genuinely needs.

Why "1M context" is impractical on consumer GPUs

Run the formula to its conclusion. At 1M tokens, our 70B GQA model needs:

1,048,576 tokens × 0.3125 MB/token ≈ 320 GB of FP16 KV cache

That is the cache alone — before the ~40 GB of weights, before activation buffers, before the 1–2 GB of CUDA runtime overhead. 320 GB is roughly four 80 GB data-center GPUs wired together just to hold the keys and values. Quantize the KV to FP8 and you halve it to ~160 GB — still two such cards. No 24 GB RTX 4090 or 32 GB RTX 5090 is within an order of magnitude of this.

Even a comparatively lean 8B GQA model (32 layers, 8 KV heads, head_dim 128) costs ~0.125 MB/token, which is ~16 GB of FP16 KV at 128K — already most of a single 24 GB card before you add the 5 GB of weights. The headline "1M token context" you see on hosted GPT-5.5, Claude Opus, and Gemini Pro is a property of clustered H100/H200-class infrastructure with FP8 KV and PagedAttention, not something a desktop GPU was ever going to reach. The hosted long-context model exists precisely because the memory math doesn't close on consumer hardware.

Practical guidance

The actionable version of all this is short:

Pick the context length you actually use, not the maximum the model allows. Most coding and document tasks live comfortably under 32K. Setting a 200K window "just in case" reserves VRAM you'll never fill.
Prefer GQA/MQA models for any local long-context work. It's the difference between 40 GB and 320 GB at 128K.
Turn on FP8/INT8 KV quantization before you reach for a bigger GPU. It's a free halving of the most expensive term.
Model the full budget — weights + KV + overhead — before you buy hardware. Plug your model, quant, and target context into the LLM VRAM calculator; it takes context length as an input and shows the KV cache explicitly. If you're not sure what your current card can hold, check which LLMs you can run.
Compare owning the hardware against paying per token. A long-context workload that needs multiple 80 GB GPUs may be cheaper to rent by the token; run the break-even in the self-hosted LLM cost calculator.

Conclusion

On a hosted API, context length is a number on an invoice. On your own GPU it is a line item in VRAM that grows linearly with every token and, past about 128K on a 70B model, becomes larger than the model itself. The KV-cache formula — 2 × layers × kv_heads × head_dim × seq_len × bytes — is the whole story: GQA shrinks the heads, quantization shrinks the bytes, PagedAttention stops you from wasting the rest, and seq_len is the dial you actually choose. Budget for the cache before you commit to the context window, and the 1M-token fantasy resolves into a concrete, answerable hardware question.

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

The KV cache: what it is and why it exists

The memory formula

Doubling context can double your VRAM

Taming the KV cache

Why "1M context" is impractical on consumer GPUs

Practical guidance

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

LLM VRAM Calculator

Self-Hosted LLM Cost Calculator

What LLM Can I Run?

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

Context Window Limits: Managing Long Documents in LLMs

Context Windows Explained: Why Size Matters for AI Coding

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

The KV cache: what it is and why it exists

The memory formula

Doubling context can double your VRAM

Taming the KV cache

Why "1M context" is impractical on consumer GPUs

Practical guidance

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

Related Tools

LLM VRAM Calculator

Self-Hosted LLM Cost Calculator

What LLM Can I Run?

Related Articles

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

Context Window Limits: Managing Long Documents in LLMs

Context Windows Explained: Why Size Matters for AI Coding

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)