Quantization is the single most effective lever for running a capable model on a card you can actually afford. Qwen2.5-Coder-7B in its native 16-bit format needs roughly 14 GB of VRAM — more than most consumer GPUs have. Quantize it and that footprint collapses, the weights fit in fast memory, and generation speeds up. The catch nobody quantifies for you: quantization is lossy, and somewhere down the bit-depth ladder, output quality falls off a cliff.
So we measured exactly where. We took one model — Qwen2.5-Coder-7B-Instruct — and ran it through every GGUF quant level from Q2_K up to Q8_0 on an RTX 5060 Ti (16 GB, Blackwell) with llama.cpp's CUDA backend, recording generation speed, VRAM, file size, and two quality signals at each step. Everything below is reproducible on a stock llama.cpp release, and the harness plus raw data are open source.
The results, up front
| Metric | What we found |
|---|---|
| Speed range | Q2_K hits 91.2 tok/s; Q8_0 bottoms at 49.6 tok/s — a 1.84× spread |
| Output fidelity vs Q8_0 | Q2_K 21.5% → Q4_K_M 56.6% → Q6_K 79.7% |
| The cliff | Loss rate roughly doubles below Q4 |
| The sweet spot | Q4–Q5 — most of the savings, tolerable quality loss |
The short version: lower bits genuinely buy you speed and VRAM, but the fidelity cost is not linear. From Q8 down to Q4 you slide gradually; below Q4 you fall off a ledge. Q4_K_M and Q5_K_M are where almost everyone should live. Here's the full picture, and how to land on the right side of that ledge.
Why quantization is the lever everyone needs to understand
Running a 7B model at full 16-bit precision means streaming ~14 GB of weights through GPU memory on every forward pass. Quantization maps each weight from a 16-bit float to a smaller integer format, trading numerical precision for a large reduction in file size and VRAM footprint. That smaller footprint is what lets a big model fit on a small card — and it's also why it runs faster.
The reason smaller means faster is that local LLM inference is almost entirely memory-bandwidth-bound: the GPU spends most of each forward pass moving weights from VRAM to its compute cores, not doing arithmetic. A model that is half the size streams in roughly half the time. Q4_K_M on this 7B model uses 4,673 MB of VRAM versus 7,669 MB at Q8_0, and generates tokens about 1.6× faster — not because it computes less, but because the data movement is cheaper. (If you're sizing hardware around this, our guide on how much VRAM you need to run an LLM goes deeper.)
The flip side is that quantization is lossy and irreversible. At high bit-depths (Q6, Q8) the rounding error is small and practically invisible. Push to Q3 or Q2 and those numerical errors compound across thousands of layers until the generated text diverges meaningfully from what the full-precision model would have said. For the full conceptual background, see what quantization actually does and how the GGUF format stores these weights.
How quantization works — and where fidelity breaks
Each weight in the model is a floating-point number. Quantization groups weights into blocks and maps the values in each block onto a small set of representable integers: 8 bits gives 256 levels of precision, 4 bits gives 16, 2 bits gives only 4. The fewer the levels, the coarser the rounding, and the larger the accumulated error in outputs. GGUF's K-quant family (Q3_K_M, Q4_K_M, Q5_K_M…) stores a per-block scaling factor that recovers some quality — but the fundamental limit is always the bit depth. Below Q4, there simply aren't enough levels to represent the model's weight distribution cleanly.
Left bars: weight file size, proportional to bit-depth (all teal — size is the point). Right bars: how closely each quant's greedy output matches Q8_0 (green = close, amber = moderate drift, red = substantial drift). The red dashed line marks the quality cliff — below Q4_K_M, fidelity drops steeply.
The similarity score measures how much quantization changed the model's greedy token choices versus the Q8_0 reference — it is a drift proxy, not an absolute correctness guarantee. A 57%-similar model might still be perfectly adequate for your use case; a 90%-similar one might still get a critical answer wrong. Use it as a relative signal: a large drop tells you the quant is substantially altering outputs, not just slightly rounding them.
The setup — reproducible on any GPU
We kept the methodology deliberately boring so the numbers are comparable: one model, greedy decoding (temperature 0), 256 tokens generated per prompt, and a shared 12-prompt suite spanning code, math, reasoning, summarization, and chat. Per quant we record generation speed, VRAM footprint, file size, and two quality signals — a graded math pass-rate over 8 multi-step problems with exact integer answers, and output similarity versus Q8_0 (mean difflib sequence-match ratio over the shared suite).
| Component | What we used |
|---|---|
| GPU | RTX 5060 Ti — Blackwell (sm_120), 16 GB VRAM, running at PCIe 3.0 ×8 |
| Runtime | llama.cpp, CUDA backend |
| Model | Qwen2.5-Coder-7B-Instruct (GGUF; bartowski repo) |
| Quant levels | Q2_K · Q3_K_M · Q4_K_M · Q5_K_M · Q6_K · Q8_0 |
| Reference | Q8_0 — highest-precision quant tested; the similarity baseline |
One honesty note up front: in-VRAM throughput is GPU-memory-bandwidth bound, so the test rig's bus, system RAM, and SSD speed barely move the speed numbers — they only shape offload and CPU-bound scenarios we didn't sweep here.
The numbers
Here is the full sweep, every quant level on one card:
| Quant | Size (MB) | VRAM (MB) | tok/s | Math | Similarity vs Q8_0 |
|---|---|---|---|---|---|
| Q2_K | 3,016 | 3,205 | 91.2 | 2/8 | 21.5% |
| Q3_K_M | 3,808 | 3,907 | 73.0 | 3/8 | 40.9% |
| Q4_K_M | 4,683 | 4,673 | 79.0 | 3/8 | 56.6% |
| Q5_K_M | 5,445 | 5,333 | 69.8 | 3/8 | 68.9% |
| Q6_K | 6,254 | 6,037 | 59.1 | 3/8 | 79.7% |
| Q8_0 (reference) | 8,099 | 7,669 | 49.6 | 3/8 | 100.0% |
The speed column trends faster as bits drop — Q2_K peaks at 91.2 tok/s, Q8_0 bottoms at 49.6 tok/s, a 1.84× range. But it's not a clean monotone: Q3_K_M (73.0 tok/s) is actually slower than Q4_K_M (79.0 tok/s) despite having fewer bits. That's a K-quant structure quirk — Q3_K_M's dequantization layout can be less cache-friendly than Q4_K_M's. Read the speed trend as a general gradient, not a strict line; small differences between adjacent levels are noise.
The quality columns are where the real story is. The math pass-rate is nearly flat — 3/8 across Q3_K_M through Q8_0, dropping a single point to 2/8 only at Q2_K. The output-similarity column tells a far clearer story: it falls from 100% at Q8_0 to 79.7% at Q6_K, to 56.6% at Q4_K_M, then drops steeply to 40.9% at Q3_K_M and all the way to 21.5% at Q2_K. At that point nearly 4 in 5 output tokens differ from what Q8_0 would have generated — this isn't minor rounding noise; the model is qualitatively different.
What we learned
The fidelity cliff is real and steep below Q4
From Q8_0 down to Q4_K_M, similarity drops from 100% to 56.6% — a gradual 43-point slide over four quant levels. Then Q4_K_M to Q3_K_M loses another 16 points in one step (56.6% → 40.9%), and Q3_K_M to Q2_K drops 19 more (40.9% → 21.5%). The cliff isn't a metaphor: the loss rate roughly doubles below Q4. For anything that needs factual accuracy, code correctness, or coherent multi-step reasoning, treat Q2 and Q3 as genuinely lossy formats — not slightly compressed versions of the same model.
The practical knee is Q4–Q5
Q4_K_M cuts the file from 8.1 GB to 4.7 GB (42% smaller) and VRAM from 7,669 MB to 4,673 MB, while output similarity sits at 56.6% — well above the cliff. Q5_K_M pushes similarity to 68.9% at 5.4 GB and 5,333 MB; that extra ~700 MB buys a meaningful fidelity improvement. The Q5→Q6 step (68.9% → 79.7%) costs another ~0.9 GB for a solid gain, and is worth it when VRAM headroom allows. But the biggest bang-per-gigabyte is in the Q4→Q5 range. Below Q4 you're buying speed and size at a disproportionate quality cost.
Speed favors lower bits, but not in a straight line
Broadly, fewer bits stream through memory faster — the model is smaller, and inference is bandwidth-bound. The range across the full sweep is 91.2 tok/s (Q2_K) versus 49.6 tok/s (Q8_0), a genuine 1.84× at the extremes. Within the middle, don't over-read small gaps: Q3_K_M at 73.0 tok/s trails Q4_K_M at 79.0 tok/s, an artifact of K-quant dequantization overhead rather than a principle. Speed trends are real at scale; adjacent-level comparisons can reverse. (For what these tok/s figures feel like in practice, see local LLM performance — what to expect.)
The math pass-rate is not a sensitive signal here
The 8-question graded math set is hard — multi-step word problems with exact integer extraction — and a 7B instruct model scores 3/8 across almost every quant level. Q2_K drops to 2/8: a data point, not a verdict. The flat line doesn't mean quant doesn't affect correctness; it means this particular probe isn't fine-grained enough to separate the levels. The output-similarity slope is the primary quality signal in this experiment. We're honest that output-similarity plus a small graded set are relatable proxies, not gold-standard metrics — perplexity and KL-divergence versus FP16 would be more rigorous, and those are reserved for a future pass.
"Similarity" measures drift from Q8_0, not absolute quality
Output similarity tells you how much quantization changed the model's greedy completions relative to Q8_0 — a drift proxy, not a correctness certificate. Two caveats matter: (1) Q8_0 itself introduces some quantization error versus FP16; it's the highest-precision quant in this sweep, not ground truth. (2) A lower-similarity quant might still produce correct code on your task, and a high-similarity quant might still fail an edge case. Run your own prompts on Q4 and Q5 before committing to one for production.
What to actually pick
If you want one answer: Q4_K_M. It's the right default for consumer GPU use — 42% VRAM savings versus Q8_0, about 1.6× faster generation, and 57% output similarity, which sits below the cliff but above the collapse zone. It's the level you reach for first and only move off of for a specific reason.
Move up to Q5_K_M when VRAM allows and fidelity matters more than the last bit of speed — similarity climbs to 69% for a modest ~700 MB. Go to Q6_K if you've got ~6 GB to spare on the model and want 80% similarity for higher-stakes work.
Avoid Q2_K and Q3_K_M unless you've explicitly tested your target task and found the output acceptable. At 21% and 41% similarity, these are qualitatively different models, not mildly lossy compressions. The one time they earn a look is when Q4_K_M genuinely won't fit your VRAM budget — and even then, benchmark Q3_K_M on your actual prompts first. Do not assume it behaves like Q4. The whole point of this sweep is that the gap between them is a cliff, not a slope.
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder GGUF quants (bartowski's repo has Q2_K through Q8_0). Drop them in your models directory and run the sweep:
# run the quant sweep — missing quants are skipped automatically
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py # → results/data.json
python scripts/inject.py # bakes numbers into the writeup
Best results come from an otherwise-idle GPU so the VRAM delta is clean. This is experiment #3 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where the quantization cliff sits across real hardware and different models. We have one 7B coder model on one Blackwell card; we want yours. A different model family, a larger model, an older GPU, a CPU-only run — every data point sharpens the picture, and scripts/make_submission.py auto-captures your rig into a pull-request-ready file. If a quant level surprised you — for better or worse — that's exactly the kind of result worth sharing.
New to running models on your own hardware? Start with LLM quantization explained, the GGUF model format, and the complete guide to running local AI.
