Skip to main content
Home/Tools/Developer/LLM VRAM Calculator

LLM VRAM Calculator

Calculate how much VRAM any LLM needs to run locally. Pick a model, quantization, and context size — see download size, total memory required, and which GPUs it fits on.

100% Private - Runs Entirely in Your Browser
No data is sent to any server. All processing happens locally on your device.
Loading LLM VRAM Calculator...

Model & Settings

Loading interactive tool...

You build the idea. I'll ship the product.

Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.

What Determines How Much Memory an LLM Needs?

Three things consume GPU memory when running a language model:

1. Model weights — the parameters themselves. This is the big one: parameter count x bytes per parameter. A 8B model at FP16 is 16 GB; the same model quantized to Q4_K_M is about 4.9 GB.

2. KV cache — the attention state for every token in your context window. This grows linearly with context length and depends on the model's architecture: number of layers, KV heads, and head dimension. Modern architectural tricks (grouped-query attention, sliding-window layers, DeepSeek's MLA compression) exist mostly to shrink this number.

3. Runtime overhead — CUDA or Metal context, activation buffers, and memory fragmentation. Typically 5-10% of the model size, with a floor of roughly 0.5-1 GB.

The calculator above computes all three from each model's real architecture (pulled from its config.json on Hugging Face), so the totals match what you'll actually see in nvidia-smi or Activity Monitor.

Quantization: Trading Quality for VRAM

Quantization is the single most effective way to fit a bigger model on your hardware. It stores weights at lower numeric precision:

FormatBytes/weightQuality impact
FP162.0Reference quality
Q8_0~1.06Indistinguishable from FP16
Q6_K~0.82Near-lossless
Q5_K_M~0.71Very good
Q4_K_M~0.61Good — the popular default
Q3_K_M~0.50Noticeable degradation
IQ2_M~0.36Significant degradation

The practical guidance: Q4_K_M is the sweet spot for most use. Larger models tolerate aggressive quantization better than small ones — a 70B model at Q3 usually beats a 14B model at Q8. If you have spare VRAM, step up to Q5_K_M or Q6_K rather than leaving it idle.

The KV cache can be quantized too (llama.cpp's cache-type-k q8_0 flag), which halves context memory at negligible cost — worth doing for long-context work.

Multi-GPU Setups and Apple Silicon

When one GPU is not enough, you have two fundamentally different paths:

Multiple NVIDIA/AMD GPUs. VRAM stacks: two 24 GB cards hold a ~45 GB model. Layer splitting (llama.cpp, Ollama) is the easy path — it works with mismatched cards and normal PCIe slots, but speed stays at single-GPU levels. Tensor parallelism (vLLM, TensorRT-LLM) actually multiplies bandwidth, but wants identical GPUs and fast interconnect. Used 3090s remain the budget favorite for this: 24 GB and 936 GB/s per card.

Apple Silicon unified memory. A Mac Studio with 128-512 GB of unified memory can hold models that would need 4-8 discrete GPUs. Bandwidth is the trade-off: 546 GB/s (M4 Max) to 819 GB/s (M3 Ultra) versus 1000-3350 GB/s for discrete cards, so generation is slower. For models above ~50 GB, Apple Silicon is often the cheapest hardware that runs them at all.

One thing that does NOT work: combining VRAM across machines over a network, or mixing Apple unified memory with discrete GPUs. The memory pool must be on one machine.

Frequently Asked Questions

Common questions about the LLM VRAM Calculator

As a rule of thumb at the popular Q4_K_M quantization: 7-9B models need about 6-8 GB, 13-14B models need about 10-12 GB, 27-32B models need about 20-24 GB, and 70B models need about 43-48 GB (two 24 GB GPUs). Add more for long context windows — the KV cache grows linearly with context. This calculator computes the exact number for any model, quantization, and context size.

Total VRAM = model weights + KV cache + runtime overhead. Weights = parameter count x bytes per parameter (2 bytes at FP16, ~0.6 bytes at Q4_K_M). KV cache = 2 x layers x KV heads x head dimension x context length x bytes per element — this is what grows when you increase context. Overhead (~6%, minimum ~0.75 GB) covers CUDA/Metal buffers, activations, and memory fragmentation.

Quantization stores model weights at lower precision to save memory. Compared to FP16 (2 bytes per weight): Q8_0 halves memory with virtually no quality loss, Q4_K_M cuts it to about 30% with minor quality loss (the most popular choice), and 2-bit quants cut it to about 18% with significant quality loss. A 70B model goes from 141 GB at FP16 to about 43 GB at Q4_K_M — the difference between needing a server rack and two consumer GPUs.

The KV cache stores the attention keys and values for every token in your context window, so the model does not have to recompute them for each new token. It grows linearly with context length — at 8K context it is usually small, but at 128K it can exceed the size of the model weights themselves. Models with grouped-query attention (most modern LLMs), sliding-window layers (Gemma), or MLA compression (DeepSeek) need dramatically less KV cache than older architectures.

Not entirely on the GPU at useful quality. A 70B model at Q4_K_M needs about 43 GB. Your options: 1) Use two 24 GB GPUs (llama.cpp and vLLM split models across GPUs automatically). 2) Run a smaller quant like IQ2_M (~25 GB) — quality suffers noticeably. 3) Offload some layers to system RAM — works but generation slows to a crawl (often under 2 tokens/sec). 4) Use the 70B model's smaller siblings (Qwen3 32B and Gemma 3 27B at Q4 fit in 24 GB and are surprisingly capable).

Yes — VRAM capacity stacks across GPUs. Layer split (llama.cpp/Ollama default) puts different layers on different GPUs; it works with mismatched cards over PCIe but does not increase speed. Tensor parallel (vLLM, TensorRT-LLM) splits every layer across identical GPUs and scales bandwidth almost linearly, but needs fast interconnect. Apple Silicon cannot combine memory across machines — the unified memory ceiling is the limit.

Yes, very well. Apple Silicon shares one memory pool between CPU and GPU, so a 128 GB M4 Max can hold models that would need multiple NVIDIA GPUs. By default macOS lets the GPU use about 75% of unified memory (this calculator accounts for that). The trade-off is bandwidth: M4 Max (546 GB/s) is roughly half an RTX 4090 (1008 GB/s), so generation is slower — but for large models that do not fit in 24 GB of VRAM at all, slow beats impossible.

Inference engines fall back to CPU offloading — some layers run on the GPU, the rest on the CPU from system RAM. It works, but every offloaded layer is bottlenecked by system RAM bandwidth (50-90 GB/s vs 1000+ GB/s on a GPU), so speed drops sharply. A model that is 20% offloaded can be 3-5x slower than one that fits entirely. If you see single-digit tokens/sec, this is usually why.

Mixture-of-Experts models like Qwen3 30B-A3B only read about 3B parameters per token, which makes them fast. But all 30B parameters must still be loaded in memory, because different tokens route to different experts and the engine cannot predict which. So MoE models need the VRAM of their total size but generate at the speed of their active size — the best of one world, the cost of the other.

Within about 5-10% for typical setups. Architecture specs (layers, KV heads, head dimensions) come from each model's actual config.json on Hugging Face, and quantization sizes match real GGUF file sizes. Real usage varies by inference engine: llama.cpp, vLLM, ExLlama, and MLX each have different overhead, context pre-allocation, and padding behavior. Treat results as a close estimate for planning hardware, not a guarantee.

0