How do I calculate the VRAM a model needs?

Total VRAM ≈ (parameters × bytes-per-weight for your quant) + KV cache (grows with context) + ~1–2 GB runtime overhead. For weights alone: an 8B model at Q4_K_M (~0.55 bytes/weight) is roughly 4.4–4.9 GB; at FP16 (2 bytes/weight) it's ~16 GB. Then add the KV cache for your context length and ~1–2 GB of CUDA/runtime overhead. The [llm-vram-calculator](/tools/developer/llm-vram-calculator) does this math for any model, quant, and context.

What's the rule of thumb for model size vs VRAM?

At Q4 the quick estimate is model-billions ÷ 2 ≈ GB of weights. An 8B is ~4.4–4.9 GB, a 13B is ~7.5–8 GB, a 70B is ~38–43 GB. That's weights only — always add the KV cache (which grows linearly with context) and ~1–2 GB overhead, then size up. At Q8 use roughly model-billions × 1 (≈1.06 bytes/weight); at FP16 use model-billions × 2.

Can I run a model that's bigger than my VRAM?

Yes, with trade-offs. You can quantize harder (Q4 → Q3/Q2/IQ-quants) to shrink the weights, offload some layers to system RAM (CPU offload — it works but decode slows sharply because those layers run on the CPU), or split the model across multiple GPUs (two 24 GB cards hold a 70B Q4 that one can't). Apple Silicon sidesteps the issue with a large unified memory pool. The right choice depends on how much you're over by — check the [what-llm-can-i-run](/tools/developer/what-llm-can-i-run) tool against your hardware.

Does context length affect VRAM?

Yes, significantly. The KV cache grows linearly with context length and is separate from the weights. A 70B-class model (80 layers, 8 KV heads, head_dim 128, GQA, FP16) uses ~0.3125 MB per token — about 1.25 GB at 4K context but ~40 GB at 128K. Quantizing the KV cache to FP8/INT8 halves that. See [The KV cache: where your VRAM really goes](/blog/llm-kv-cache-vram-memory-cost).

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

The short answer

A model fits on your GPU if weights + KV cache + overhead is less than your VRAM. The dominant term is almost always the weights, and you control their size with quantization:

total_vram ≈ (num_params × bytes_per_weight) + kv_cache + ~1–2 GB overhead

As a back-of-the-envelope rule at the most common quant, Q4_K_M, the weights are roughly model-billions ÷ 2 in GB. An 8B is ~4.5 GB, a 13B is ~8 GB, a 70B is ~40 GB. Then you add the KV cache for your context and a gigabyte or two of runtime overhead, and round up.

Here is the lookup table most people actually want — weights-only size by model and quant, and the smallest consumer GPU class each fits on with room for a modest context:

Model	Q4_K_M	Q8_0	FP16	Smallest GPU (Q4)
8B	~4.4–4.9 GB	~8.5 GB	~16 GB	8 GB
13B	~7.5–8.0 GB	~13.8 GB	~26 GB	12 GB
70B	~38–43 GB	~74 GB	~140 GB	48 GB (or 2× 24 GB)

For exact numbers on a specific model, quant, and context length, use the llm-vram-calculator — the rest of this article explains where every number in that table comes from so you can do the math in your head.

The VRAM formula

Three things occupy GPU memory when you run an LLM. In order of size:

1. Weights. This is the model itself, and it's deterministic:

weights_bytes = num_params × bytes_per_weight

bytes_per_weight is set entirely by the precision you load the model at. The clean version is bits ÷ 8, but quantized GGUF formats carry per-block scale/min metadata, so the effective bytes per weight run slightly above the nominal value:

Precision	Bits/weight	Effective bytes/weight
FP16 / BF16	16	2.0
FP8 (E4M3)	8	1.0
Q8_0	~8.5	~1.06
Q6_K	~6.6	~0.82
Q5_K_M	~5.7	~0.71
Q4_K_M	~4.8	~0.55
Q3_K_M	~3.9	~0.49
Q2_K	~3.3	~0.41

Q4_K_M is the default for good reason: it's the best quality-per-gigabyte for most use, costing only ~1–3% perplexity over FP16 (Llama-3.1-8B measures ppl ~7.56 vs ~7.32 at FP16, +3.3%). Quality stays mild down to ~4 bits and then falls off a cliff — Q3 is noticeably weaker, Q2_K is clearly degraded. If you need more detail on the naming, see LLM quantization explained.

2. KV cache. Attention caches a key and value vector for every token in context, and that cache grows linearly with sequence length. It's small at short context and enormous at long context — more on this below.

3. Activations + runtime overhead. Transient buffers, CUDA context, and fragmentation add roughly 1–2 GB. Serving engines like vLLM pre-reserve a large KV pool on top, so their footprint looks bigger than llama.cpp for the same model.

So the full picture is weights + kv_cache + ~1–2 GB. The weights are fixed once you pick a quant; the KV cache is the variable you most often forget.

Worked examples: 8B, 13B, 70B

Let's run the formula on three model sizes at three precisions. The bar chart below shows how dramatically quantization moves the weights-only footprint for a single 70B model — the difference between "needs a data-center GPU" and "fits on two gaming cards":

8B (e.g. Llama-3.1-8B). At Q4_K_M, 8B × 0.55 ≈ 4.4 GB of weights — published GGUFs land around 4.9 GB. Add a few GB for context and overhead and it runs comfortably on an 8 GB card. At Q8_0 it's ~8.5 GB (needs 12 GB), and at FP16 ~16 GB (needs 16–24 GB).

13B. Q4_K_M is ~7.5–8 GB, so a 12 GB card handles it with light context. Q8_0 (~13.8 GB) wants 16 GB; FP16 (~26 GB) needs a 32 GB+ card (a 24 GB card can't hold it) — quantize instead.

70B (e.g. Llama-3-70B). Q4_K_M weights are ~38–43 GB. With KV and overhead that's ~48 GB total at modest context — so a single 48 GB card or two 24 GB GPUs (an RTX 4090 pair). Q8_0 (~74 GB) needs an 80 GB A100/H100; FP16 (~140 GB) requires multiple data-center GPUs.

Putting the fit story together across the consumer GPU classes:

Weights (Q4_K_M)	8 GB	12 GB	16 GB	24 GB	48 GB
8B (~4.5 GB)	✅	✅	✅	✅	✅
13B (~8 GB)	❌	✅	✅	✅	✅
70B (~40 GB)	❌	❌	❌	2×	✅

(✅ = fits with room for a working context; "2×" = needs two cards.) Not sure what's in your machine? The what-llm-can-i-run tool detects your GPU and lists the models that fit in one click.

The KV cache surprise

The table above is weights-only — and at short context that's almost the whole story. The trap is long context, because the KV cache grows linearly with sequence length and can quietly dwarf the model itself.

The formula:

kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem

The 2 is separate K and V tensors. The reason modern 70B models stay tractable is Grouped-Query Attention (GQA): Llama-3 70B has 64 query heads but only 8 KV heads, an 8× reduction in KV width. Working the numbers for that model (80 layers, 8 KV heads, head_dim 128, FP16 KV):

Per-token KV = 2 × 80 × 8 × 128 × 2 bytes = 327,680 bytes ≈ 0.3125 MB/token.

Context	KV cache (FP16)	KV cache (FP8/INT8)
4K	~1.25 GB	~0.625 GB
32K	~10 GB	~5 GB
128K	~40 GB	~20 GB

At 4K context the KV cache is a rounding error next to 40 GB of weights. At 128K it matches the model. That same 70B without GQA would be 8× worse — ~320 GB at 128K — which is why GQA plus KV-cache quantization (FP8/INT8 K and V) are mandatory for long context. Quantizing the KV cache halves or quarters it for a small quality cost.

This is the single most common reason a model that "fit" in the calculator suddenly OOMs in practice — see The KV cache: where your VRAM really goes for the full breakdown.

What if it doesn't fit?

Four levers, in roughly increasing order of pain:

Quantize harder. Dropping from Q8 to Q4_K_M nearly halves the weights for ~1–3% perplexity. Below Q4 the quality cliff steepens fast — prefer I-quants (IQ3/IQ2) over legacy Q3/Q2 when you're squeezing into tight VRAM, and quantize the KV cache before you quantize the weights into the ground.
CPU offload. llama.cpp and Ollama can keep some layers in system RAM and run them on the CPU. It works and lets a 70B run on a 24 GB card — but decode slows sharply because those layers no longer get GPU bandwidth. Fine for occasional use, painful for interactive work.
Split across GPUs. Two 24 GB cards hold a 70B Q4 that neither can alone. Text transformers tensor- and pipeline-parallelize cleanly because every decoder layer is the same shape. This is the right answer when you have the slots and power.
Pick a smaller model. An 8B at Q8 often beats a 70B at Q2_K, both in quality and speed. Don't quantize a big model into mush when a smaller one fits natively.

If you're weighing hardware spend against just paying a cloud API, the self-hosted LLM cost calculator finds the break-even. And if your goal is fine-tuning rather than inference, training needs far more memory for optimizer states and gradients — size it with the fine-tuning-vram-calculator.

Apple Silicon: unified memory changes the math

On a discrete GPU, "VRAM" is a hard wall. Apple Silicon has unified memory — CPU and GPU share one pool with zero-copy access — so a 64 GB Mac can devote most of that 64 GB to model weights and KV cache. That's why Macs punch above their weight for large models: a 70B Q4 that needs two 24 GB PCIe cards fits comfortably in a single 64 GB Mac's unified memory.

The catch is bandwidth, which sets decode speed. Memory-bound decode runs at roughly bandwidth ÷ active_bytes_per_token, and Apple's bandwidth trails data-center GPUs:

Chip	Memory	Bandwidth	70B Q4 decode (approx)
Apple M4 Max	up to 128 GB unified	~546 GB/s (full)	~8–13 tok/s
Apple M3 Max	up to 128 GB unified	~400 GB/s (full)	~6–10 tok/s
RTX 4090 (2×)	2× 24 GB	~1008 GB/s each	~18–25 tok/s
H100 80GB	80 GB	~3350 GB/s	~40–60 tok/s

(tok/s are approximate, single-stream ballparks.) So a Mac fits the big model where a single consumer GPU can't, but generates more slowly than a multi-GPU PC. macOS tooling has caught up: Ollama 0.19+ and LM Studio both run an Apple MLX backend that's meaningfully faster than the old Metal path on Apple Silicon with 32 GB+ of unified memory. To estimate decode speed for a specific chip and quant, use the llm-inference-speed-calculator.

Conclusion

The whole question reduces to one inequality: weights + KV cache + ~1–2 GB overhead ≤ your VRAM. Weights are params × bytes_per_weight and you control them with quantization (Q4_K_M ≈ model-billions ÷ 2 in GB). The KV cache is the term people forget — it's negligible at 4K but can exceed the model at 128K, so always size up for the context you actually use.

Estimate it by hand from the formulas here, then confirm with the llm-vram-calculator for your exact model, quant, and context — and round up before you buy.

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

The short answer

The VRAM formula

Worked examples: 8B, 13B, 70B

The KV cache surprise

What if it doesn't fit?

Apple Silicon: unified memory changes the math

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

LLM VRAM Calculator

What LLM Can I Run?

Fine-Tuning VRAM Calculator

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

Running LLMs on Apple Silicon: MLX vs GGUF and Why Macs Punch Above Their Weight

How Much VRAM Do You Need to Run an LLM? (The Memory Math, Explained)

The short answer

The VRAM formula

Worked examples: 8B, 13B, 70B

The KV cache surprise

What if it doesn't fit?

Apple Silicon: unified memory changes the math

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

Related Tools

LLM VRAM Calculator

What LLM Can I Run?

Fine-Tuning VRAM Calculator

Related Articles

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality

The Hidden Memory Cost of Long Context: KV Cache and VRAM Explained

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

Running LLMs on Apple Silicon: MLX vs GGUF and Why Macs Punch Above Their Weight