Skip to main content
Home/Tools/Developer/LLM GPU Benchmark

LLM GPU Benchmark

Speed test your GPU for AI — measure real memory bandwidth and compute with WebGPU, run an actual LLM in your browser, and see predicted speeds for every popular model on your hardware.

100% Private - Runs Entirely in Your Browser
No data is sent to any server. All processing happens locally on your device.
Loading LLM GPU Benchmark...

Benchmark

Loading interactive tool...

You build the idea. I'll ship the product.

Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.

Reading Your Benchmark Results

The two numbers the quick test produces map directly to LLM behavior:

Memory bandwidth (GB/s) determines generation speed. A model generates one token by reading all of its active parameters from memory. At Q4_K_M quantization, an 8B model is about 5 GB of reads per token — so 500 GB/s of bandwidth gives you roughly 65 tokens/sec (after real-world efficiency losses). This is why the bandwidth gauge is the one to watch.

Compute throughput (TFLOPS) determines prompt processing speed. When you paste a long document, all of its tokens are processed in parallel — that workload saturates the math units rather than memory. Low compute = long waits before the first word of a response appears, even if generation is fast afterward.

The browser penalty: WebGPU measurements run 25-35% below what native software achieves on the same hardware. The "estimated native" figures and the predictions table account for this — they show what Ollama or LM Studio would actually deliver.

Why Run a Real Model Instead of Trusting the Quick Test?

The quick test measures your hardware's raw capabilities. The real model test (Stage 2) measures everything else that affects actual LLM performance:

Kernel efficiency: how well the inference engine's GPU code uses your specific architecture. Memory access patterns: real attention computation is less cache-friendly than a synthetic benchmark. Driver behavior: scheduling, power states, and thermal response under sustained load. The full pipeline: tokenization, sampling, and detokenization overhead.

The gap between your quick-test prediction and your real-model measurement tells you how much software efficiency matters on your system. A small gap means your hardware is well-supported; a large gap usually points to driver issues, thermal throttling, or an integrated GPU being used instead of your discrete one.

For the most accurate picture of a specific model you plan to run, nothing beats downloading it in Ollama or LM Studio and testing — but this benchmark gets you 90% of the answer in 2% of the time.

Frequently Asked Questions

Common questions about the LLM GPU Benchmark

Modern browsers expose your GPU through WebGPU, the successor to WebGL. The quick test runs compute shaders that read a large buffer (measuring memory bandwidth) and execute millions of fused multiply-add operations (measuring compute throughput). The real model test goes further: it loads an actual LLM into your GPU via the WebLLM library and generates text. Everything runs locally — no data leaves your machine.

Generating each token requires reading every active model parameter from GPU memory, but only doing about 2 math operations per parameter read. GPUs can do hundreds of operations in the time it takes to read one value, so the math units spend most of their time waiting for memory. Your bandwidth number divided by the model size (in bytes) is approximately your maximum tokens per second.

WebGPU adds overhead: stricter security checks, less optimized kernels than hand-tuned CUDA/Metal code, and FP32-only compute paths in many cases. Browser inference typically achieves 60-75% of native performance. The benchmark accounts for this — the "estimated native bandwidth" and predicted model speeds are adjusted to reflect what you would get from Ollama, llama.cpp, or LM Studio on the same hardware.

The real-model test downloads open-weight models (Qwen, Llama, SmolLM) directly from Hugging Face — the same files everyone uses. Your browser caches them in its storage system (IndexedDB/Cache API), so repeat tests skip the download. You can clear them anytime via your browser settings (clear site data). Nothing is installed on your system.

The benchmark requires WebGPU, which is fully supported in Chrome 113+, Edge 113+, and Safari 18+. Firefox has WebGPU support in progress (available in nightly builds). If your browser does not support WebGPU, you can still use our LLM Inference Speed Calculator, which estimates speed from your GPU's published specs instead of measuring it.

Several causes: 1) Browser overhead (expect 25-35% below spec — this is normal and accounted for). 2) Other tabs or apps using the GPU — close them and re-run. 3) Laptop power management — plug in and set performance mode. 4) On laptops with two GPUs, the browser may be using the integrated GPU instead of the discrete one — check your browser's GPU settings (chrome://gpu). 5) Thermal throttling on thin laptops.

For memory bandwidth (the number that matters most): under 100 GB/s (integrated graphics) — only small models run well; 200-500 GB/s (mainstream GPUs, Apple M-series) — 7-14B models run great; 500-1000 GB/s (high-end consumer: RTX 4080/4090, M-series Max) — 30B models are comfortable; 1000+ GB/s (RTX 5090, datacenter) — 70B models become practical. The predicted speeds table translates your exact number into real model performance.

0