Question 1

How much VRAM do I need to run an LLM locally?

Accepted Answer

As a rule of thumb at the popular Q4_K_M quantization: **7-9B models need about 6-8 GB**, **13-14B models need about 10-12 GB**, **27-32B models need about 20-24 GB**, and **70B models need about 43-48 GB** (two 24 GB GPUs). Add more for long context windows — the KV cache grows linearly with context. This calculator computes the exact number for any model, quantization, and context size.

Question 2

How is LLM VRAM usage calculated?

Accepted Answer

Total VRAM = **model weights** + **KV cache** + **runtime overhead**. Weights = parameter count x bytes per parameter (2 bytes at FP16, ~0.6 bytes at Q4_K_M). KV cache = 2 x layers x KV heads x head dimension x context length x bytes per element — this is what grows when you increase context. Overhead (~6%, minimum ~0.75 GB) covers CUDA/Metal buffers, activations, and memory fragmentation.

Question 3

What is quantization and how much VRAM does it save?

Accepted Answer

Quantization stores model weights at lower precision to save memory. Compared to FP16 (2 bytes per weight): **Q8_0 halves memory** with virtually no quality loss, **Q4_K_M cuts it to about 30%** with minor quality loss (the most popular choice), and **2-bit quants cut it to about 18%** with significant quality loss. A 70B model goes from 141 GB at FP16 to about 43 GB at Q4_K_M — the difference between needing a server rack and two consumer GPUs.

Question 4

What is the KV cache and why does context size matter?

Accepted Answer

The KV cache stores the attention keys and values for every token in your context window, so the model does not have to recompute them for each new token. It grows **linearly with context length** — at 8K context it is usually small, but at 128K it can exceed the size of the model weights themselves. Models with grouped-query attention (most modern LLMs), sliding-window layers (Gemma), or MLA compression (DeepSeek) need dramatically less KV cache than older architectures.

Question 5

Can I run a 70B model on a 24GB GPU?

Accepted Answer

Not entirely on the GPU at useful quality. A 70B model at Q4_K_M needs about 43 GB. Your options: **1)** Use two 24 GB GPUs (llama.cpp and vLLM split models across GPUs automatically). **2)** Run a smaller quant like IQ2_M (~25 GB) — quality suffers noticeably. **3)** Offload some layers to system RAM — works but generation slows to a crawl (often under 2 tokens/sec). **4)** Use the 70B model's smaller siblings (Qwen3 32B and Gemma 3 27B at Q4 fit in 24 GB and are surprisingly capable).

Question 6

Can I split an LLM across multiple GPUs?

Accepted Answer

Yes — VRAM capacity stacks across GPUs. **Layer split** (llama.cpp/Ollama default) puts different layers on different GPUs; it works with mismatched cards over PCIe but does not increase speed. **Tensor parallel** (vLLM, TensorRT-LLM) splits every layer across identical GPUs and scales bandwidth almost linearly, but needs fast interconnect. Apple Silicon cannot combine memory across machines — the unified memory ceiling is the limit.

Question 7

Does Apple Silicon unified memory work for running LLMs?

Accepted Answer

Yes, very well. Apple Silicon shares one memory pool between CPU and GPU, so a 128 GB M4 Max can hold models that would need multiple NVIDIA GPUs. By default macOS lets the GPU use about 75% of unified memory (this calculator accounts for that). The trade-off is bandwidth: M4 Max (546 GB/s) is roughly half an RTX 4090 (1008 GB/s), so generation is slower — but for large models that do not fit in 24 GB of VRAM at all, slow beats impossible.

Question 8

What happens if a model does not fit in VRAM?

Accepted Answer

Inference engines fall back to **CPU offloading** — some layers run on the GPU, the rest on the CPU from system RAM. It works, but every offloaded layer is bottlenecked by system RAM bandwidth (50-90 GB/s vs 1000+ GB/s on a GPU), so speed drops sharply. A model that is 20% offloaded can be 3-5x slower than one that fits entirely. If you see single-digit tokens/sec, this is usually why.

Question 9

Why do MoE models still need lots of VRAM if only some experts are active?

Accepted Answer

Mixture-of-Experts models like Qwen3 30B-A3B only **read** about 3B parameters per token, which makes them fast. But all 30B parameters must still be **loaded in memory**, because different tokens route to different experts and the engine cannot predict which. So MoE models need the VRAM of their total size but generate at the speed of their active size — the best of one world, the cost of the other.

Question 10

How accurate is this calculator?

Accepted Answer

Within about 5-10% for typical setups. Architecture specs (layers, KV heads, head dimensions) come from each model's actual config.json on Hugging Face, and quantization sizes match real GGUF file sizes. Real usage varies by inference engine: llama.cpp, vLLM, ExLlama, and MLX each have different overhead, context pre-allocation, and padding behavior. Treat results as a close estimate for planning hardware, not a guarantee.

Format	Bytes/weight	Quality impact
FP16	2.0	Reference quality
Q8_0	~1.06	Indistinguishable from FP16
Q6_K	~0.82	Near-lossless
Q5_K_M	~0.71	Very good
Q4_K_M	~0.61	Good — the popular default
Q3_K_M	~0.50	Noticeable degradation
IQ2_M	~0.36	Significant degradation

LLM VRAM Calculator

Calculate How Much VRAM an LLM Needs

The Two Things That Use VRAM

How Quantization Helps

When to Use It

What Determines How Much Memory an LLM Needs?

Quantization: Trading Quality for VRAM

Multi-GPU Setups and Apple Silicon

Frequently Asked Questions

Related tools