Skip to main content
Home/Tools/Developer/What LLM Can I Run?

What LLM Can I Run?

Detect your GPU with one click and see which LLMs your computer can actually run — ranked by whether they fit in your VRAM, need CPU offloading, or will not run at all.

100% Private - Runs Entirely in Your Browser
No data is sent to any server. All processing happens locally on your device.
Loading What LLM Can I Run?...

Hardware

Loading interactive tool...

You build the idea. I'll ship the product.

Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.

How We Decide What "Runs" on Your Hardware

Whether a model runs is a memory question first and a speed question second.

Memory gate: a model needs room for its weights (parameter count x bytes per parameter at the chosen quantization), its KV cache (grows with context length), and runtime overhead. We check this against your GPU memory first, then against GPU + system RAM combined (CPU offloading).

Speed estimate: for models that fit, generation speed is approximately memory bandwidth divided by bytes read per token, derated to 65% for real-world engine efficiency. Models that need CPU offloading are penalized by system RAM bandwidth, which is the actual bottleneck.

Quantization ladder: we test each model from Q8_0 (near-lossless) down to IQ2_M (heavily compressed), and report the best quality level that fits. A model that only fits at 2-bit quantization is ranked lower because the quality loss is significant.

Reading Your Results: What to Actually Download

Once you know what runs, here is how to act on it:

For chat and general use: pick the largest model in your "runs great" tier. Larger models are smarter; there is rarely a reason to run a 4B model if a 12B model runs well on your hardware.

For coding: coding-tuned models (Qwen3 Coder, Devstral) outperform general models of the same size at code tasks. A 30B MoE coder that runs great beats a 70B general model that runs slowly — latency matters when you are iterating.

For long documents: check the "max context" figure on each card. A model that runs great at 8K context may not handle a 100K-token document. Models with efficient attention (Gemma sliding-window, DeepSeek MLA) degrade least at long context.

Then download it: use the Ollama Command Builder linked on each model card to get the exact pull command for the quantization that fits your hardware.

Frequently Asked Questions

Common questions about the What LLM Can I Run?

Your browser exposes the GPU model name through WebGL and WebGPU APIs (the same way games detect graphics settings). We match that name against our hardware database to get its specs. Everything happens locally in your browser — nothing is sent to a server. Browsers never reveal memory amounts for privacy reasons, which is why we ask you to confirm your VRAM and RAM.

Runs great: the model fits entirely in your GPU memory at good quality (Q4_K_M or better) and generates at comfortable speed. Runs with trade-offs: it works, but needs aggressive quantization (quality loss) or runs slowly. Technically runs: requires CPU offloading or extreme compression — expect single-digit tokens per second. Will not run: the model does not fit in your combined GPU memory and system RAM at any quantization.

When a model does not fit entirely in GPU memory, inference engines like llama.cpp can keep some layers in system RAM and run them on the CPU. It works, but system RAM bandwidth (50-90 GB/s) is 10-40x slower than GPU memory bandwidth, so every offloaded layer drags down generation speed. A model that is 40% offloaded typically runs 3-5x slower than one that fits entirely on the GPU.

Mixture-of-Experts (MoE) models only read a fraction of their parameters for each token. GPT-OSS 120B reads just 5.1B parameters per token, so it generates as fast as a 5B model — but it still needs memory for all 120B parameters. That is why you will sometimes see a 30B MoE model ranked "runs great" while a 32B dense model is marked slow: same memory needs, very different speeds.

For Llama 3.3 70B at Q4_K_M (about 43 GB), you need roughly 48 GB of GPU memory: two 24 GB GPUs (2x RTX 3090 or 4090), one 48 GB workstation card (RTX 6000 Ada), a Mac with 64 GB+ unified memory, or a datacenter GPU. On a 64 GB Mac it runs at about 5-7 tokens/sec; on 2x RTX 4090 with tensor parallelism, around 20-25 tokens/sec.

Safari and some privacy-focused browsers report a generic "Apple GPU" or hide the renderer string entirely. Some Linux setups report the Mesa driver name instead of the GPU model. If detection fails, just pick your hardware manually from the vendor list — the results are identical.

Yes — rankings are computed at 8K tokens of context, a realistic everyday setting. Each model card also shows the maximum context that fits on your hardware. Long contexts need significantly more memory (the KV cache grows linearly), so a model that fits at 8K may not fit at 64K.

The quantization levels we test (Q8_0 down to IQ2_M) correspond to the standard GGUF files you will find on Hugging Face and in Ollama. When a model is ranked "runs great at Q4_K_M", that is exactly the file variant to download. Use our Ollama Command Builder to get the right pull command.

0