Question 1

How does the GPU detection work?

Accepted Answer

Your browser exposes the GPU model name through WebGL and WebGPU APIs (the same way games detect graphics settings). We match that name against our hardware database to get its specs. Everything happens locally in your browser — nothing is sent to a server. Browsers never reveal memory amounts for privacy reasons, which is why we ask you to confirm your VRAM and RAM.

Question 2

What do the tiers mean?

Accepted Answer

**Runs great**: the model fits entirely in your GPU memory at good quality (Q4_K_M or better) and generates at comfortable speed. **Runs with trade-offs**: it works, but needs aggressive quantization (quality loss) or runs slowly. **Technically runs**: requires CPU offloading or extreme compression — expect single-digit tokens per second. **Will not run**: the model does not fit in your combined GPU memory and system RAM at any quantization.

Question 3

What is CPU offloading and why is it slow?

Accepted Answer

When a model does not fit entirely in GPU memory, inference engines like llama.cpp can keep some layers in system RAM and run them on the CPU. It works, but system RAM bandwidth (50-90 GB/s) is 10-40x slower than GPU memory bandwidth, so every offloaded layer drags down generation speed. A model that is 40% offloaded typically runs 3-5x slower than one that fits entirely on the GPU.

Question 4

Why do some huge models run faster than smaller ones?

Accepted Answer

Mixture-of-Experts (MoE) models only read a fraction of their parameters for each token. GPT-OSS 120B reads just 5.1B parameters per token, so it generates as fast as a 5B model — but it still needs memory for all 120B parameters. That is why you will sometimes see a 30B MoE model ranked "runs great" while a 32B dense model is marked slow: same memory needs, very different speeds.

Question 5

What hardware do I need to run a 70B model well?

Accepted Answer

For Llama 3.3 70B at Q4_K_M (about 43 GB), you need roughly 48 GB of GPU memory: two 24 GB GPUs (2x RTX 3090 or 4090), one 48 GB workstation card (RTX 6000 Ada), a Mac with 64 GB+ unified memory, or a datacenter GPU. On a 64 GB Mac it runs at about 5-7 tokens/sec; on 2x RTX 4090 with tensor parallelism, around 20-25 tokens/sec.

Question 6

My GPU was not detected. Why?

Accepted Answer

Safari and some privacy-focused browsers report a generic "Apple GPU" or hide the renderer string entirely. Some Linux setups report the Mesa driver name instead of the GPU model. If detection fails, just pick your hardware manually from the vendor list — the results are identical.

Question 7

Does this account for context length?

Accepted Answer

Yes — rankings are computed at 8K tokens of context, a realistic everyday setting. Each model card also shows the maximum context that fits on your hardware. Long contexts need significantly more memory (the KV cache grows linearly), so a model that fits at 8K may not fit at 64K.

Question 8

What about quantized models I download from Hugging Face?

Accepted Answer

The quantization levels we test (Q8_0 down to IQ2_M) correspond to the standard GGUF files you will find on Hugging Face and in Ollama. When a model is ranked "runs great at Q4_K_M", that is exactly the file variant to download. Use our Ollama Command Builder to get the right pull command.

What LLM Can I Run?

See Which LLMs Your Computer Can Actually Run

How the Fit Check Works

What You Get

When to Use It

How We Decide What "Runs" on Your Hardware

Reading Your Results: What to Actually Download

Frequently Asked Questions

Related tools