Skip to main content
Home/Blog/LLM Quantization Explained: How to Shrink Models Without Wrecking Quality
Artificial Intelligence

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality

Quantization is the dial that lets a 70B model fit on a consumer GPU. Here's what FP16, INT8, and 4-bit actually mean, what you lose at each level, and how to decode those cryptic Q4_K_M filenames.

By InventiveHQ Team

The problem quantization solves

A model's weights are just numbers. Train a 70-billion-parameter model and you have 70 billion of them, and by default each is stored as a 16-bit float (FP16 or BF16) — two bytes apiece. That's 140 GB of weights before you load a single token of context. No consumer GPU comes close; even an 80 GB A100 can't hold it at full precision.

Quantization is the fix. Instead of two bytes per weight, store each weight in fewer bits — 8, 5, 4, even 2 — and the file shrinks proportionally. The total VRAM you need is roughly:

total VRAM ≈ (num_params × bytes_per_weight) + KV cache + activations + ~1–2 GB overhead

The first term is the one quantization attacks, and it's usually the largest. Drop a 70B model from FP16 (2.0 bytes/weight) to Q4_K_M (~0.55 bytes/weight) and the weights fall from ~140 GB to ~40 GB — suddenly it fits across two 24 GB cards, or one 48 GB card. The catch is the quality cost, and the whole game is knowing where on the dial that cost stops being free.

Precision levels: FP16 → INT8 → INT4 → lower

Bytes per weight is just bits divided by eight. The wrinkle is that the better quantization methods (the "K-quants" below) store per-block scale and offset metadata alongside the weights, so the real on-disk size runs a little above the clean bits/8 figure. The "effective" column is what actually lands on your GPU.

PrecisionBits/weightBytes/weight (effective)
FP16 / BF16162.0
FP8 (E4M3)81.0
INT8 / Q8_0~8.5~1.06
Q6_K~6.6~0.82
Q5_K_M~5.7~0.71
Q4_K_M~4.8~0.55
Q4_0 (legacy)~4.5~0.56
Q3_K_M~3.9~0.49
Q2_K~3.3~0.41

Multiply by parameter count to get weight-only size. Using effective bytes/weight:

ModelQ4_K_M (~0.55 B/w)Q8_0 (~1.06 B/w)FP16 (2.0 B/w)
8B~4.4–4.9 GB~8.5 GB~16 GB
13B~7.5–8.0 GB~13.8 GB~26 GB
70B~38–43 GB~74 GB~140 GB

These match published GGUF sizes closely — Llama-3.1-8B Q4_K_M ships at ≈4.9 GB and Q8_0 at ≈8.5 GB; Llama-3-70B Q4_K_M lands around 40–43 GB. Remember these are weights only. Add the KV cache and overhead and a 70B Q4_K_M needs roughly 48 GB total at modest context. Plug your own numbers into the LLM VRAM Calculator to get the full total including KV cache, or use What LLM Can I Run to auto-detect your GPU and see what fits.

Decoding GGUF quant names (Q4_K_M and friends)

GGUF — the quantized-model container format that originated in the llama.cpp project — encodes the whole quant recipe in the filename. Once you can read it, the model browser on Hugging Face stops being a wall of cryptic suffixes.

TokenMeaning
Q<N>Nominal bits per weight — Q4 = 4-bit base, Q8 = 8-bit, etc.
_KK-quant: 256-weight super-blocks with hierarchical scales, mixed precision across tensor types. Far better quality-per-byte than legacy.
_SSmall tier — all tensors at the base bit width. Smallest file, lowest quality of the three.
_MMedium tier — some sensitive tensors (attention wv, feed-forward w2, output) bumped +2 bits. The balanced default.
_LLarge tier — more sensitive tensors bumped +2 bits. Biggest K-quant file, best quality.
Q4_0 / Q4_1Legacy round-to-nearest block quant. _0 = scale only; _1 = scale + min (asymmetric). Superseded by K-quants — avoid unless a tool requires them.
IQ<N>_*I-quants — importance-matrix / codebook quants (IQ2/IQ3/IQ4). Better quality than same-bit K-quants at low bits, but slower to decode and need an imatrix. Use to squeeze into tight VRAM.

So Q4_K_M reads as "4-bit, K-quant method, Medium tier." Q5_K_S is "5-bit, K-quant, Small tier." IQ3_XXS is a 3-bit importance-matrix quant tuned for the absolute tightest fit. Within a bit level, the tier suffix is a pure size-versus-quality trade: _S < _M < _L in both file size and fidelity.

What you actually lose

Quality is usually measured by perplexity — how surprised the model is by held-out text, where lower is better and FP16 is the baseline. The key fact is that the loss is not linear. Down to about 4 bits per weight, degradation is mild; below 4 bits the curve turns sharply upward.

Perplexity increase versus bits per weight — the quality stays near-zero down to ~4 bits, then climbs steeply below it 0% 10% 20% 30% 40% Perplexity rise vs FP16 Q8_0 ~0% Q6_K Q5_K_M Q4_K_M +3.3% Q3_K_M Q2_K (degraded) ~4-bit cliff 8 6 5 4 3 2 Bits per weight

Reading the dial from the top:

  • Q8_0 is effectively indistinguishable from FP16 for inference. If you have the VRAM, it's a "free" safety blanket — but you rarely need it.
  • Q6_K and Q5_K_M are near-lossless. Worth the extra gigabytes when you have headroom and care about subtle reasoning or code quality.
  • Q4_K_M is the sweet spot: ~1–3% perplexity rise (Llama-3.1-8B measures ppl ~7.56 vs ~7.32 FP16, +3.3%). Almost nobody notices the difference on everyday tasks.
  • Q3_K_M is usable but noticeably weaker — reach for it only when Q4 won't fit.
  • Q2_K is clearly degraded. Use it only when nothing else fits, and prefer an IQ2 I-quant over Q2_K at that bit level.

The shape of that curve is the single most important thing to internalize: spending bits above ~5 buys almost nothing, and spending below ~4 costs a lot fast.

Picking the right quant for your hardware

The heuristic is simple: fit the largest model you can at Q4_K_M, and only step up the quant if you still have spare VRAM. Parameters buy you more capability than precision does. A 70B at Q4_K_M (~40 GB) decisively beats a 13B at Q8_0 (~14 GB) on reasoning, instruction-following, and code — and it beats a 13B at FP16 too. Within a single model, going from Q4_K_M to Q8_0 roughly doubles the memory for a quality bump most users can't detect.

Put concretely, the "biggest model that fits at Q4" rule maps cleanly to common cards:

Your VRAMLargest comfortable model at Q4_K_MNotes
8 GB7–8B~5 GB weights + KV + overhead
12 GB13Btight at long context
24 GB (RTX 4090)32–34Bor an 8B at Q8_0 with room to spare
48 GB (2× 24 GB)70B~40 GB weights + KV across two cards
80 GB (A100/H100)70B with long context, or 100B-classholds full KV at 32K+

These are weight-plus-overhead approximations — actual fit depends on your context length, because the KV cache grows linearly with tokens and can rival the weights at long context. Don't eyeball it: the LLM VRAM Calculator computes weights + KV + overhead for any model, quant, and context, and handles multi-GPU and Apple Silicon unified memory. If you want to know how fast it'll actually run once it fits, the LLM Inference Speed Calculator estimates tokens/sec from your card's memory bandwidth — decode speed is bandwidth-bound, so a model that barely fits will still run, just slowly.

Other quantization schemes (GPTQ, AWQ, bitsandbytes)

GGUF K-quants are the local/CPU-and-Apple-Silicon world (llama.cpp, Ollama, LM Studio). On the GPU-serving side you'll meet a different family, because engines like vLLM treat safetensors, AWQ, GPTQ, and FP8 as their first-class path and consider GGUF experimental:

  • GPTQ — post-training quant that minimizes layer-wise error using approximate second-order (Hessian) information, typically to 4-bit. Fast GPU inference, widely supported.
  • AWQ (Activation-aware Weight Quantization) — protects the small fraction of weights that matter most based on activation magnitudes, preserving quality at 4-bit. Often the preferred 4-bit path for vLLM.
  • bitsandbytes — on-the-fly 8-bit and 4-bit (NF4) quantization in the Hugging Face/PyTorch stack, popular for QLoRA fine-tuning rather than max-throughput serving.
  • FP8 — a native 8-bit float (E4M3) supported on newer GPUs; near-lossless and hardware-accelerated, increasingly the default for high-throughput serving.

The mental model: GGUF K-quants and I-quants are for running on the broadest hardware (including CPU and Macs) via llama.cpp-based tools, while GPTQ/AWQ/FP8 are for squeezing maximum throughput out of NVIDIA GPUs under vLLM/TensorRT-LLM. If you only have a GGUF file and want production GPU serving, the official advice is to use llama.cpp's own server rather than forcing GGUF into vLLM.

Conclusion

Quantization is the size-versus-quality dial, and the curve does the talking: above ~5 bits you're paying for fidelity you can't perceive, and below ~4 bits quality falls off a cliff. For almost everyone, the right move is to default to Q4_K_M and spend your VRAM on more parameters, not more bits — a bigger model at Q4 beats a smaller model at Q8 nearly every time. Decode the filename, fit the largest model your card allows with the LLM VRAM Calculator, and step up to Q5/Q6 only when you've got gigabytes to spare.

Frequently Asked Questions

Find answers to common questions

Quantization stores a model's weights at lower numerical precision — for example 4-bit integers instead of 16-bit floats — to cut memory use and speed up inference. A weight that took 2 bytes at FP16 takes roughly 0.55 bytes at Q4_K_M, so the model shrinks by about 3.6x for a small, usually-acceptable quality cost. It's what lets a 70B model that needs ~140 GB at full precision run in ~40 GB on consumer hardware.

It's a GGUF quant name. Q4 = 4-bit base precision. _K = the llama.cpp 'K-quant' method, which uses 256-weight super-blocks with hierarchical scales and bumps sensitive tensors to higher precision. _M = the Medium size/quality tier (between _S Small and _L Large). Q4_K_M is the popular default because it gives the best quality-per-gigabyte for most use, costing only ~1–3% perplexity versus FP16.

For most models, 4-bit Q4_K_M is nearly indistinguishable from full precision on everyday tasks — Llama-3.1-8B measures ppl ~7.56 at Q4_K_M versus ~7.32 at FP16, about +3.3%. Quality stays mild down to ~4 bits per weight. Below that the curve steepens fast: Q3_K_M is usable but noticeably weaker, and Q2_K is clearly degraded.

Pick the largest model at the highest quant that fits your VRAM, and default to Q4_K_M. A bigger model at Q4 almost always beats a smaller model at Q8 — a 70B at Q4_K_M (~40 GB) outperforms a 13B at Q8_0 (~14 GB). Use the LLM VRAM Calculator at /tools/developer/llm-vram-calculator to find the largest model that fits your card.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.