What is KV-cache quantization?

KV-cache quantization stores the key and value tensors that an LLM caches in VRAM at a lower bit-depth than the usual 16-bit (f16). In llama.cpp you set it with the -ctk and -ctv flags: q8_0 stores each cached value in 8 bits (half the memory), and q4_0 stores it in 4 bits (a quarter of the memory). It shrinks the cache without touching the model weights, so it's the most targeted lever you have for fitting more context into a fixed VRAM budget.

Is q8_0 KV cache safe to use in production?

In our benchmark, q8_0 KV kept output 81.6% similar to the f16 reference over a 12-prompt suite — recognizably the same text with some rephrasing and restructuring. That's a defensible trade when you need to stretch context, but "81.6% similar" is a blunt proxy. The right test is to run your own task prompts through both f16 and q8_0 KV and measure the delta on whatever actually matters to you (correctness, code that compiles, structured output), not just textual similarity.

Does KV-cache quantization slow down generation?

Essentially no. All three configs we tested ran within about 5 tok/s of each other (81.8 f16, 76.4 q8_0, 80.3 q4_0). Notably, the q4_0 config that produced near-broken output ran at 80.3 tok/s — almost as fast as the perfect-quality f16 reference. This is exactly why throughput monitoring is dangerous here: speed is not the diagnostic, output text is.

How much VRAM does KV-cache quantization actually save?

At 8K context on a 7B model, not much — q8_0 saved 208 MB and q4_0 saved 320 MB versus f16, because at short context the model weights dominate and the KV cache is a minority of total VRAM. The payoff scales sharply with context length. At 32K or 128K context the KV cache can become the majority of your VRAM budget, and quantizing it can free multiple gigabytes — the difference between fitting a long-context workload on a consumer card or not.

Should I use q4_0 KV cache at all?

Only if you've explicitly tested its output against f16 for your specific task and found it acceptable. The extra VRAM saving over q8_0 was just 112 MB at 8K context — rarely worth sacrificing coherent output for. For approximate retrieval it might be tolerable; for code generation, math, structured output, or anything where correctness matters, almost certainly not, and the failures will be silent.

Can I reproduce this benchmark?

Yes. It runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B-Instruct-Q4_K_M GGUF. The benchmark harness, the 12-prompt suite, and our raw results are all in the open-source repo, and we're collecting community results from other hardware and context lengths via pull request. Running it at 32K context, where the VRAM math gets interesting, is an especially welcome contribution.

KV-Cache Quantization: The q4_0 Cliff Your Logs Won't Warn You About

Q: Why is q4_0 KV cache a problem?

At 4-bit precision the rounding in each cached key/value element is aggressive enough that attention scores diverge from f16 on every layer. Those small divergences compound across 32+ layers and hundreds of generated tokens until the output has taken a completely different trajectory. In our test, q4_0 KV output was only 8.3% similar to the f16 baseline — near-total divergence, not mild degradation. And because it's a different answer rather than a broken one, nothing in your logs flags it.

This is experiment #4 in our open-source local-LLM benchmark series, and it's about a setting that quietly ships in a lot of local-inference setups: KV-cache quantization. The pitch is irresistible — flip two flags, cut the per-token memory cost by half or three-quarters, and fit more context in the same VRAM. The flags work exactly as advertised. The VRAM drops. The speed barely changes.

What almost nobody measures is the other half of the trade: what happens to the model's output. So we measured it. One model, one context length, three KV-cache precisions — f16, q8_0, and q4_0 — and a direct comparison of the text each one produces. The result is a clean two-act story, and the second act is a cliff.

The results, up front

Here's the whole experiment in one table — the same model at 8K context, three KV-cache types, measured against the f16 reference:

KV cache	VRAM at 8K	Generation speed	Output similarity vs f16
f16 (16-bit, reference)	4,899 MB	81.8 tok/s	100%
q8_0 (8-bit)	4,691 MB (−208 MB)	76.4 tok/s	81.6% — safe trade
q4_0 (4-bit)	4,579 MB (−320 MB)	80.3 tok/s	8.3% — quality cliff

The short version: q8_0 KV is a reasonable lever — meaningful VRAM saved, output stays recognizable. q4_0 KV is a trap — it saves only 112 MB more than q8_0 at this context, but output similarity collapses to 8.3%. That's not "slightly worse." That's a different answer to the same prompt. And nothing about the speed (a healthy 80.3 tok/s) would ever tell you.

Why KV-cache quantization matters

Every token your model processes leaves a trace in the KV cache — the key and value tensors from each attention layer, stored in VRAM so the model doesn't have to recompute them on the next token. At short contexts this is a rounding error. At long ones it dominates: a 7B model might use ~4 GB for weights but another 2–4 GB or more for KV entries at 32K+ context. The KV cache is what makes long-context inference expensive, and quantizing it is the most targeted lever you have — you're shrinking the cache without touching model weights at all. (For the full breakdown of how the cache grows, see the KV cache's VRAM and memory cost.)

In llama.cpp this is the -ctk / -ctv flag pair. Set them to q8_0 or q4_0 and you cut the per-token VRAM cost by 50% or 75% respectively. The promise: more context fits in the same VRAM budget. The question this experiment answers: what do you actually give up in output quality — and is there a safe stopping point?

The savings compound with context length. At 8K context, KV is a modest fraction of total VRAM because model weights dominate. At 32K or 128K context, the KV cache can become the majority of your VRAM budget — that is where quantizing it earns back multiple gigabytes and makes the difference between fitting or not on a consumer card. The numbers in this experiment are at 8K; extrapolate the delta and the gap widens substantially.

How KV-cache quantization works (60-second version)

The KV cache is a table of floating-point values — one row per token, one column per attention head, repeated across all layers. Normally stored in f16 (16 bits per value). Quantizing it means storing each value in fewer bits: 8 for q8_0, 4 for q4_0. Fewer bits means a smaller table and less VRAM. The catch is rounding: every stored value loses precision. At 8 bits the rounding is minor. At 4 bits the rounding is aggressive enough that the model's attention lookups read meaningfully different numbers — and the outputs diverge accordingly. (If quantization in general is new to you, our LLM quantization explainer covers the weight-side story; KV-cache quantization is the same idea applied to the cache instead of the weights.)

Here's the whole trade laid out across the three configs — bit-depth, VRAM, and output fidelity side by side:

The "bits stored" bar is proportional to bit-depth (f16 = full, q8_0 = half, q4_0 = quarter). The fidelity gauge shows mean difflib similarity of each config's greedy outputs vs f16 over the 12-prompt suite. The q4_0 bar is nearly empty — this is not quantization noise, it's a different model response.

The setup

We kept the methodology deliberately boring so the numbers are comparable: one model (Qwen2.5-Coder-7B-Instruct-Q4_K_M), run at a large context (8192 tokens) with three KV-cache precisions — f16 (full precision, baseline), q8_0 (8-bit KV), and q4_0 (4-bit KV). Greedy decoding (temperature 0), 256 tokens generated per prompt, a fixed 12-prompt suite spanning code, math, and reasoning. Per cache type we record VRAM used (nvidia-smi delta on model load), generation tok/s, and output similarity — how close each config's greedy output stays to the f16 baseline (mean difflib ratio over the suite; 100% = no change).

Reference test rig. Every number on this page was measured on one machine — an Alienware Aurora R7: Intel Core i7-8700 (6 cores / 12 threads, Coffee Lake); 64 GB DDR4-2400; an NVIDIA RTX 5060 Ti 16 GB (Blackwell) running at PCIe 3.0 ×8; models on a SATA SSD; Windows Server 2025. In-VRAM throughput is GPU-memory-bandwidth bound, so bus / RAM / SSD speed barely move it. Running this yourself? scripts/make_submission.py auto-captures your rig (CPU, RAM, GPU + PCIe link, board, storage) into the submission.

A note on rigor: output similarity (difflib ratio) is a self-contained, relatable proxy metric. The gold-standard quality measure is perplexity or KL-divergence versus FP16. This experiment deliberately favors a signal you can feel — "is this still the same answer?" — over an academic one.

Results

KV type	VRAM (MB)	tok/s	Similarity vs f16
f16 (reference)	4,899	81.8	100%
q8_0	4,691	76.4	81.6%
q4_0	4,579	80.3	8.3%

The charts tell a two-act story. Act one — VRAM and speed — looks almost boring: q8_0 saves 208 MB (4%), q4_0 saves 320 MB (6.5%), and generation speed barely moves (81.8 → 76.4 → 80.3 tok/s). Throughput monitoring alone would never flag a problem.

Act two ends that complacency. The output similarity drops from 100% (f16 reference) to 81.6% at q8_0 — recognizably similar text with some rephrasing and restructuring — and then to 8.3% at q4_0. That is not quantization noise; that is a completely different response to the same prompt. Across the 12-prompt suite spanning code, math, and reasoning, the q4_0 KV cache produced outputs that bore almost no textual resemblance to the f16 baseline. The speed was fine. The output was not.

What we learned

q4_0 KV is a quality cliff, not a quality slope

The drop from q8_0 (81.6% similar) to q4_0 (8.3% similar) is not a smooth gradient — it's a cliff. At 4-bit precision, the rounding in each K/V element is aggressive enough that attention scores diverge meaningfully from f16 on every layer. Small divergences compound across 32+ layers and hundreds of generated tokens until the output has taken a completely different trajectory. This is not a "slightly worse" mode; it is a mode where you get different answers. If your evaluation doesn't explicitly diff q4_0 output against f16, you will not know it's happening.

The VRAM savings are modest at short context — and the calculus changes at long context

At 8K context on a 7B model, the KV cache is a minority of total VRAM. The model weights (Q4_K_M, ~4.3 GB) dominate; the KV entries are a small slice. That's why f16 uses 4,899 MB and q4_0 uses 4,579 MB — a saving of 320 MB on a card with 16 GB. At 32K context the math reverses: the KV cache becomes the majority of your VRAM budget, and quantizing from f16 to q8_0 can free multiple gigabytes and make the difference between fitting and not fitting a long-context workload on a consumer card. The lesson isn't that KV quantization is useless; it's that the payoff scales with context length. If you're sizing hardware, our guide on how much VRAM you need to run an LLM sets the baseline, and the cliff here is the quality side of that same budget.

q8_0 is the lever worth reaching for

With 81.6% output similarity and 208 MB freed at 8K (scaling to much more at 32K+), q8_0 is a defensible production choice when you need to stretch your VRAM budget. The quality cost is real — difflib similarity is a blunt instrument and the actual task-level degradation will depend on your prompts and what you're measuring — but outputs remain coherent and recognizable. The right test is to run your actual task prompts through both f16 and q8_0 KV and measure the delta for things that matter to you, not just the difflib ratio.

Speed is not the diagnostic — output text is

All three configs run within about 5 tok/s of each other (81.8 / 76.4 / 80.3 tok/s). The model producing near-totally broken output (q4_0 at 8.3% similarity) runs at 80.3 tok/s — nearly as fast as the perfect-quality f16 reference. Throughput monitoring is not enough. If you're serving LLM output in production with KV quantization enabled, you need periodic output comparisons to a reference, not just latency dashboards. (This is the same lesson our other benchmarks keep surfacing about what to actually expect from local LLMs: the headline number is rarely the one that bites you.)

Takeaway: reach for q8_0, avoid q4_0 unless you've tested it

When you need to stretch context on a consumer GPU, -ctk q8_0 -ctv q8_0 is the flag pair to reach for. It frees meaningful VRAM at long context, and while output quality degrades measurably, the outputs remain recognizable. At 128K context a q8_0 KV cache saves gigabytes that would otherwise be inaccessible on a 16 GB card. Measure the quality delta for your workload, but the tradeoff is well-behaved.

-ctk q4_0 -ctv q4_0 is a different story. The 8.3% output similarity measured here is not a mild degradation — it's near-total divergence from the f16 reference. The additional VRAM saving over q8_0 is 112 MB at 8K context: rarely worth sacrificing coherent output for.

The key question: is 8.3% output similarity to f16 acceptable for your use case? For some approximate retrieval tasks, perhaps. For code generation, math, structured output, or any task where correctness matters, almost certainly not — and the failures will be silent.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with the public Qwen2.5-Coder-7B-Instruct-Q4_K_M GGUF. Download it into your models dir, then:

python scripts/bench.py --backend cuda --gpu 0 --ctx 8192
python scripts/aggregate.py        # -> results/data.json
python scripts/inject.py           # bakes into site/index.html

Watch the -ctk / -ctv flags do their work, and compare each config's output against the f16 reference rather than trusting the tok/s number.

This is where you come in. This is experiment #4 in our open-source local-LLM benchmark series, and the most valuable contribution here is running it at longer context. Our numbers are at 8K, where the VRAM savings are modest; at 32K or 128K the savings balloon and the q8_0-vs-q4_0 decision gets much higher-stakes. A different GPU, a different model family, a longer context window — every data point sharpens the picture of where the cliff actually is.

scripts/make_submission.py auto-captures your hardware (CPU, RAM, GPU + PCIe link, board, storage) into a single JSON file; open a pull request adding it and we'll fold it into a growing community comparison. Full instructions, plus the experiment itself, are in the repo: github.com/InventiveHQ/local-llm-benchmarks. Code is MIT, submitted data is CC BY 4.0.

New to fitting models into a fixed VRAM budget? Start with how much VRAM you need to run an LLM, the KV cache's VRAM and memory cost, and our complete guide to running local AI.

KV-Cache Quantization: The q4_0 Cliff Your Logs Won't Warn You About

The results, up front

Why KV-cache quantization matters

How KV-cache quantization works (60-second version)

The setup

Results

What we learned

q4_0 KV is a quality cliff, not a quality slope

The VRAM savings are modest at short context — and the calculus changes at long context

q8_0 is the lever worth reaching for

Speed is not the diagnostic — output text is

Takeaway: reach for q8_0, avoid q4_0 unless you've tested it

Reproduce it — and send us your numbers

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality

KV-Cache Quantization: The q4_0 Cliff Your Logs Won't Warn You About

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

Related articles

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality