The short answer
A model fits on your GPU if weights + KV cache + overhead is less than your VRAM. The dominant term is almost always the weights, and you control their size with quantization:
total_vram ≈ (num_params × bytes_per_weight) + kv_cache + ~1–2 GB overhead
As a back-of-the-envelope rule at the most common quant, Q4_K_M, the weights are roughly model-billions ÷ 2 in GB. An 8B is ~4.5 GB, a 13B is ~8 GB, a 70B is ~40 GB. Then you add the KV cache for your context and a gigabyte or two of runtime overhead, and round up.
Here is the lookup table most people actually want — weights-only size by model and quant, and the smallest consumer GPU class each fits on with room for a modest context:
| Model | Q4_K_M | Q8_0 | FP16 | Smallest GPU (Q4) |
|---|---|---|---|---|
| 8B | ~4.4–4.9 GB | ~8.5 GB | ~16 GB | 8 GB |
| 13B | ~7.5–8.0 GB | ~13.8 GB | ~26 GB | 12 GB |
| 70B | ~38–43 GB | ~74 GB | ~140 GB | 48 GB (or 2× 24 GB) |
For exact numbers on a specific model, quant, and context length, use the llm-vram-calculator — the rest of this article explains where every number in that table comes from so you can do the math in your head.
The VRAM formula
Three things occupy GPU memory when you run an LLM. In order of size:
1. Weights. This is the model itself, and it's deterministic:
weights_bytes = num_params × bytes_per_weight
bytes_per_weight is set entirely by the precision you load the model at. The clean version is bits ÷ 8, but quantized GGUF formats carry per-block scale/min metadata, so the effective bytes per weight run slightly above the nominal value:
| Precision | Bits/weight | Effective bytes/weight |
|---|---|---|
| FP16 / BF16 | 16 | 2.0 |
| FP8 (E4M3) | 8 | 1.0 |
| Q8_0 | ~8.5 | ~1.06 |
| Q6_K | ~6.6 | ~0.82 |
| Q5_K_M | ~5.7 | ~0.71 |
| Q4_K_M | ~4.8 | ~0.55 |
| Q3_K_M | ~3.9 | ~0.49 |
| Q2_K | ~3.3 | ~0.41 |
Q4_K_M is the default for good reason: it's the best quality-per-gigabyte for most use, costing only ~1–3% perplexity over FP16 (Llama-3.1-8B measures ppl ~7.56 vs ~7.32 at FP16, +3.3%). Quality stays mild down to ~4 bits and then falls off a cliff — Q3 is noticeably weaker, Q2_K is clearly degraded. If you need more detail on the naming, see LLM quantization explained.
2. KV cache. Attention caches a key and value vector for every token in context, and that cache grows linearly with sequence length. It's small at short context and enormous at long context — more on this below.
3. Activations + runtime overhead. Transient buffers, CUDA context, and fragmentation add roughly 1–2 GB. Serving engines like vLLM pre-reserve a large KV pool on top, so their footprint looks bigger than llama.cpp for the same model.
So the full picture is weights + kv_cache + ~1–2 GB. The weights are fixed once you pick a quant; the KV cache is the variable you most often forget.
Worked examples: 8B, 13B, 70B
Let's run the formula on three model sizes at three precisions. The bar chart below shows how dramatically quantization moves the weights-only footprint for a single 70B model — the difference between "needs a data-center GPU" and "fits on two gaming cards":
8B (e.g. Llama-3.1-8B). At Q4_K_M, 8B × 0.55 ≈ 4.4 GB of weights — published GGUFs land around 4.9 GB. Add a few GB for context and overhead and it runs comfortably on an 8 GB card. At Q8_0 it's ~8.5 GB (needs 12 GB), and at FP16 ~16 GB (needs 16–24 GB).
13B. Q4_K_M is ~7.5–8 GB, so a 12 GB card handles it with light context. Q8_0 (~13.8 GB) wants 16 GB; FP16 (~26 GB) needs a 32 GB+ card (a 24 GB card can't hold it) — quantize instead.
70B (e.g. Llama-3-70B). Q4_K_M weights are ~38–43 GB. With KV and overhead that's ~48 GB total at modest context — so a single 48 GB card or two 24 GB GPUs (an RTX 4090 pair). Q8_0 (~74 GB) needs an 80 GB A100/H100; FP16 (~140 GB) requires multiple data-center GPUs.
Putting the fit story together across the consumer GPU classes:
| Weights (Q4_K_M) | 8 GB | 12 GB | 16 GB | 24 GB | 48 GB |
|---|---|---|---|---|---|
| 8B (~4.5 GB) | ✅ | ✅ | ✅ | ✅ | ✅ |
| 13B (~8 GB) | ❌ | ✅ | ✅ | ✅ | ✅ |
| 70B (~40 GB) | ❌ | ❌ | ❌ | 2× | ✅ |
(✅ = fits with room for a working context; "2×" = needs two cards.) Not sure what's in your machine? The what-llm-can-i-run tool detects your GPU and lists the models that fit in one click.
The KV cache surprise
The table above is weights-only — and at short context that's almost the whole story. The trap is long context, because the KV cache grows linearly with sequence length and can quietly dwarf the model itself.
The formula:
kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem
The 2 is separate K and V tensors. The reason modern 70B models stay tractable is Grouped-Query Attention (GQA): Llama-3 70B has 64 query heads but only 8 KV heads, an 8× reduction in KV width. Working the numbers for that model (80 layers, 8 KV heads, head_dim 128, FP16 KV):
Per-token KV = 2 × 80 × 8 × 128 × 2 bytes = 327,680 bytes ≈ 0.3125 MB/token.
| Context | KV cache (FP16) | KV cache (FP8/INT8) |
|---|---|---|
| 4K | ~1.25 GB | ~0.625 GB |
| 32K | ~10 GB | ~5 GB |
| 128K | ~40 GB | ~20 GB |
At 4K context the KV cache is a rounding error next to 40 GB of weights. At 128K it matches the model. That same 70B without GQA would be 8× worse — ~320 GB at 128K — which is why GQA plus KV-cache quantization (FP8/INT8 K and V) are mandatory for long context. Quantizing the KV cache halves or quarters it for a small quality cost.
This is the single most common reason a model that "fit" in the calculator suddenly OOMs in practice — see The KV cache: where your VRAM really goes for the full breakdown.
What if it doesn't fit?
Four levers, in roughly increasing order of pain:
- Quantize harder. Dropping from Q8 to Q4_K_M nearly halves the weights for ~1–3% perplexity. Below Q4 the quality cliff steepens fast — prefer I-quants (IQ3/IQ2) over legacy Q3/Q2 when you're squeezing into tight VRAM, and quantize the KV cache before you quantize the weights into the ground.
- CPU offload. llama.cpp and Ollama can keep some layers in system RAM and run them on the CPU. It works and lets a 70B run on a 24 GB card — but decode slows sharply because those layers no longer get GPU bandwidth. Fine for occasional use, painful for interactive work.
- Split across GPUs. Two 24 GB cards hold a 70B Q4 that neither can alone. Text transformers tensor- and pipeline-parallelize cleanly because every decoder layer is the same shape. This is the right answer when you have the slots and power.
- Pick a smaller model. An 8B at Q8 often beats a 70B at Q2_K, both in quality and speed. Don't quantize a big model into mush when a smaller one fits natively.
If you're weighing hardware spend against just paying a cloud API, the self-hosted LLM cost calculator finds the break-even. And if your goal is fine-tuning rather than inference, training needs far more memory for optimizer states and gradients — size it with the fine-tuning-vram-calculator.
Apple Silicon: unified memory changes the math
On a discrete GPU, "VRAM" is a hard wall. Apple Silicon has unified memory — CPU and GPU share one pool with zero-copy access — so a 64 GB Mac can devote most of that 64 GB to model weights and KV cache. That's why Macs punch above their weight for large models: a 70B Q4 that needs two 24 GB PCIe cards fits comfortably in a single 64 GB Mac's unified memory.
The catch is bandwidth, which sets decode speed. Memory-bound decode runs at roughly bandwidth ÷ active_bytes_per_token, and Apple's bandwidth trails data-center GPUs:
| Chip | Memory | Bandwidth | 70B Q4 decode (approx) |
|---|---|---|---|
| Apple M4 Max | up to 128 GB unified | ~546 GB/s (full) | ~8–13 tok/s |
| Apple M3 Max | up to 128 GB unified | ~400 GB/s (full) | ~6–10 tok/s |
| RTX 4090 (2×) | 2× 24 GB | ~1008 GB/s each | ~18–25 tok/s |
| H100 80GB | 80 GB | ~3350 GB/s | ~40–60 tok/s |
(tok/s are approximate, single-stream ballparks.) So a Mac fits the big model where a single consumer GPU can't, but generates more slowly than a multi-GPU PC. macOS tooling has caught up: Ollama 0.19+ and LM Studio both run an Apple MLX backend that's meaningfully faster than the old Metal path on Apple Silicon with 32 GB+ of unified memory. To estimate decode speed for a specific chip and quant, use the llm-inference-speed-calculator.
Conclusion
The whole question reduces to one inequality: weights + KV cache + ~1–2 GB overhead ≤ your VRAM. Weights are params × bytes_per_weight and you control them with quantization (Q4_K_M ≈ model-billions ÷ 2 in GB). The KV cache is the term people forget — it's negligible at 4K but can exceed the model at 128K, so always size up for the context you actually use.
Estimate it by hand from the formulas here, then confirm with the llm-vram-calculator for your exact model, quant, and context — and round up before you buy.