NVIDIA publishes its inference wins on eight DGX B300s. Almost nobody has those. So we ran the same ideas — speculative decoding, quantization, offload, flash attention, power limits — on a single consumer RTX 5060 Ti (~$570), plus a 2017 GTX 1080 Ti and a plain CPU, and measured what actually happens. Every result below is reproducible, the charts are built from data committed to an open-source repo, and several of these "tricks" backfire on consumer hardware — which is exactly the part the vendor blogs skip.
This is the hub for the InventiveHQ Lab Local LLM Benchmarks series. It's a living roundup: 12 experiments today, more as we (and the community) run them. If you're new to running models locally, start with our guides on what local LLM performance to expect and running local AI end to end; this series is the measured, hands-on layer underneath them.
The test rig
Every "reference" number across the series was measured on one machine — an Alienware Aurora R7:
| Component | Spec |
|---|---|
| CPU | Intel Core i7-8700 — 6 cores / 12 threads (Coffee Lake) |
| RAM | 64 GB DDR4-2400 (4 × 16 GB, dual-channel) |
| GPU A | NVIDIA RTX 5060 Ti 16 GB (Blackwell, ~$570) — PCIe 3.0 ×8 |
| GPU B | NVIDIA GTX 1080 Ti 11 GB (Pascal, 2017) — multi-GPU experiments |
| Storage | Models on a SATA SSD volume (not NVMe) |
| OS / runtime | Windows Server 2025 · llama.cpp |
In-VRAM throughput is GPU-memory-bandwidth bound, so bus / RAM / SSD speed barely move it — but they shape the offload and CPU results. Numbers are single-user (batch 1) tokens/sec unless noted — the case a local user actually feels.
The 12 experiments
Each links to its full writeup, with the headline result called out.
| # | Experiment | Headline | Read |
|---|---|---|---|
| 01 | Speculative decoding — NVIDIA's 15× trick on a $570 GPU | 0.27× → 1.85× — it backfired on the fast 7B (CUDA 0.27×) but paid off on a slow 14B (1.85× on math); the right runtime flips by card | Read → |
| 02 | Runner showdown — Ollama vs llama.cpp vs LM Studio | 0.3% vs 10% — all three wrap llama.cpp, but LM Studio is ~free over raw while Ollama costs ~10% throughput | Read → |
| 03 | GGUF quantization — where quality falls off a cliff | cliff below Q4 — output fidelity vs Q8 drops 80% (Q6) → 57% (Q4) → 21% (Q2); the knee is Q4–Q5 | Read → |
| 04 | KV-cache quantization — the q4_0 cache cliff | 8% similar — q8_0 KV is a fine trade, but q4_0 KV wrecked output (8% similar to f16) and won't warn you | Read → |
| 05 | The VRAM cliff — when your model doesn't fit | 2.9 → 43 tok/s — full-CPU is ~15× slower than full-GPU, and the last layers matter most | Read → |
| 06 | Context-length tax — what 2K→32K really costs | +1.7 GB, ~0 speed — the cost is VRAM and a little TTFT, not generation speed | Read → |
| 07 | Flash attention — is -fa actually free? | a no-op — toggling -fa changed nothing, because it's already on by default in current llama.cpp | Read → |
| 08 | Tokens per watt — cap the power, keep the speed | −17% W, +16% eff — capping 180 W → 150 W cost ~0 tok/s while improving efficiency ~16% | Read → |
| 09 | CPU thread scaling — more threads made it slower | peak at 6 cores — throughput plateaus at the physical cores; 12 threads is slower than 8 | Read → |
| 10 | Draft-model size — bigger draft = faster? No | 0.5B wins (1.37×) — a bigger draft accepts far more but costs more per step; the smallest draft wins | Read → |
| 11 | Smallest model that reasons — how small can you go? | 1 → 6 of 10 — graded reasoning becomes useful only at 7B, best at 14B, while speed falls 381 → 43 tok/s | Read → |
| 12 | MoE on CPU — 13B-class answers at 3B speed | 10.6 tok/s · 8/10 — a 30B-A3B MoE runs at a dense 3B's speed on an i7-8700 yet scores 8/10 vs the 3B's 1/10 | Read → |
The throughline
Three things show up again and again across all twelve experiments.
One: single-user decoding is memory-bandwidth-bound. This is why extra watts (experiment 08), extra threads (experiment 09), and a bigger declared context (experiment 06) buy little speed; why an 8-year-old card isn't as far behind as its spec sheet suggests; and why a 30B Mixture-of-Experts that reads only ~3B of weights per token runs at 3B speed on a plain CPU.
Two: the datacenter tricks have overhead that only pays off in the right regime. Speculation and big draft models help slow or large targets, not fast small ones — on an already-quick 7B they can make you slower.
Three: the convenient defaults are often the wrong call, and they fail quietly. An aggressive weight quant, a q4 KV cache, "use all the threads," a heavier draft model, the popular wrapper — each can silently cost you quality or speed. The only fix is to measure your own setup.
Run it yourself
Every experiment ships its harness, raw results, and writeup, and takes community submissions — run the same suite on your hardware and open a pull request. CI validates it and rebuilds the experiment's page automatically.
- Repo: github.com/InventiveHQ/local-llm-benchmarks
- Contribute your hardware's numbers: see CONTRIBUTING.md — the
make_submission.pyhelper auto-captures your rig into the submission.
Code is MIT; data and writeups are CC BY 4.0. Every number is measured, every chart is hand-authored from that data, and there isn't a single stock benchmark in the bunch.
