InventiveHQ Lab

Local LLM Benchmarks: 12 Findings From Consumer Hardware

NVIDIA publishes its inference wins on 8× DGX B300s. Almost nobody has those. So we ran the same ideas on a single consumer RTX 5060 Ti, a 2017 GTX 1080 Ti, and a CPU — and measured what actually happens. Here are all 12 experiments.

By InventiveHQ Team

NVIDIA publishes its inference wins on eight DGX B300s. Almost nobody has those. So we ran the same ideas — speculative decoding, quantization, offload, flash attention, power limits — on a single consumer RTX 5060 Ti (~$570), plus a 2017 GTX 1080 Ti and a plain CPU, and measured what actually happens. Every result below is reproducible, the charts are built from data committed to an open-source repo, and several of these "tricks" backfire on consumer hardware — which is exactly the part the vendor blogs skip.

This is the hub for the InventiveHQ Lab Local LLM Benchmarks series. It's a living roundup: 12 experiments today, more as we (and the community) run them. If you're new to running models locally, start with our guides on what local LLM performance to expect and running local AI end to end; this series is the measured, hands-on layer underneath them.

The test rig

Every "reference" number across the series was measured on one machine — an Alienware Aurora R7:

ComponentSpec
CPUIntel Core i7-8700 — 6 cores / 12 threads (Coffee Lake)
RAM64 GB DDR4-2400 (4 × 16 GB, dual-channel)
GPU ANVIDIA RTX 5060 Ti 16 GB (Blackwell, ~$570) — PCIe 3.0 ×8
GPU BNVIDIA GTX 1080 Ti 11 GB (Pascal, 2017) — multi-GPU experiments
StorageModels on a SATA SSD volume (not NVMe)
OS / runtimeWindows Server 2025 · llama.cpp

In-VRAM throughput is GPU-memory-bandwidth bound, so bus / RAM / SSD speed barely move it — but they shape the offload and CPU results. Numbers are single-user (batch 1) tokens/sec unless noted — the case a local user actually feels.

The 12 experiments

Each links to its full writeup, with the headline result called out.

#ExperimentHeadlineRead
01Speculative decoding — NVIDIA's 15× trick on a $570 GPU0.27× → 1.85× — it backfired on the fast 7B (CUDA 0.27×) but paid off on a slow 14B (1.85× on math); the right runtime flips by cardRead →
02Runner showdown — Ollama vs llama.cpp vs LM Studio0.3% vs 10% — all three wrap llama.cpp, but LM Studio is ~free over raw while Ollama costs ~10% throughputRead →
03GGUF quantization — where quality falls off a cliffcliff below Q4 — output fidelity vs Q8 drops 80% (Q6) → 57% (Q4) → 21% (Q2); the knee is Q4–Q5Read →
04KV-cache quantization — the q4_0 cache cliff8% similar — q8_0 KV is a fine trade, but q4_0 KV wrecked output (8% similar to f16) and won't warn youRead →
05The VRAM cliff — when your model doesn't fit2.9 → 43 tok/s — full-CPU is ~15× slower than full-GPU, and the last layers matter mostRead →
06Context-length tax — what 2K→32K really costs+1.7 GB, ~0 speed — the cost is VRAM and a little TTFT, not generation speedRead →
07Flash attention — is -fa actually free?a no-op — toggling -fa changed nothing, because it's already on by default in current llama.cppRead →
08Tokens per watt — cap the power, keep the speed−17% W, +16% eff — capping 180 W → 150 W cost ~0 tok/s while improving efficiency ~16%Read →
09CPU thread scaling — more threads made it slowerpeak at 6 cores — throughput plateaus at the physical cores; 12 threads is slower than 8Read →
10Draft-model size — bigger draft = faster? No0.5B wins (1.37×) — a bigger draft accepts far more but costs more per step; the smallest draft winsRead →
11Smallest model that reasons — how small can you go?1 → 6 of 10 — graded reasoning becomes useful only at 7B, best at 14B, while speed falls 381 → 43 tok/sRead →
12MoE on CPU — 13B-class answers at 3B speed10.6 tok/s · 8/10 — a 30B-A3B MoE runs at a dense 3B's speed on an i7-8700 yet scores 8/10 vs the 3B's 1/10Read →

The throughline

Three things show up again and again across all twelve experiments.

One: single-user decoding is memory-bandwidth-bound. This is why extra watts (experiment 08), extra threads (experiment 09), and a bigger declared context (experiment 06) buy little speed; why an 8-year-old card isn't as far behind as its spec sheet suggests; and why a 30B Mixture-of-Experts that reads only ~3B of weights per token runs at 3B speed on a plain CPU.

Two: the datacenter tricks have overhead that only pays off in the right regime. Speculation and big draft models help slow or large targets, not fast small ones — on an already-quick 7B they can make you slower.

Three: the convenient defaults are often the wrong call, and they fail quietly. An aggressive weight quant, a q4 KV cache, "use all the threads," a heavier draft model, the popular wrapper — each can silently cost you quality or speed. The only fix is to measure your own setup.

Run it yourself

Every experiment ships its harness, raw results, and writeup, and takes community submissions — run the same suite on your hardware and open a pull request. CI validates it and rebuilds the experiment's page automatically.

Code is MIT; data and writeups are CC BY 4.0. Every number is measured, every chart is hand-authored from that data, and there isn't a single stock benchmark in the bunch.

Frequently Asked Questions

What is the InventiveHQ Lab Local LLM Benchmarks series?

It is an open, growing set of reproducible benchmarks that test datacenter LLM-inference ideas — speculative decoding, quantization, GPU offload, flash attention, power limits, and more — on hardware ordinary people can actually buy. Every experiment ships its harness, its raw measured data, and a self-contained writeup, and every chart is built from numbers committed to the public repo. The whole thing also takes community submissions, so the results grow across real-world hardware.

Why don't your speedups match NVIDIA's headline numbers?

Vendor benchmarks usually run on racks of datacenter accelerators serving many requests at once (large batch sizes), where tricks like speculative decoding compound with massive parallelism. A local user runs one request at a time (batch size 1), where generation is bound by memory bandwidth, not raw compute. That single difference is why a 15× datacenter figure becomes tens of percent — sometimes a regression — on a single consumer GPU. The trick is real; the operating regime is completely different.

What hardware were these benchmarks run on?

The reference rig is an Alienware Aurora R7: Intel Core i7-8700 (6 cores / 12 threads), 64 GB DDR4-2400, an NVIDIA RTX 5060 Ti 16 GB (Blackwell) on PCIe 3.0 ×8, and a GTX 1080 Ti 11 GB (Pascal) for the multi-GPU experiments, running llama.cpp on Windows. Because in-VRAM throughput is GPU-memory-bandwidth bound, bus, RAM, and SSD speed barely move the GPU numbers — but they do shape the CPU and offload results.

Are these results reproducible?

Yes — that is the entire point. Every experiment runs on stock llama.cpp releases with publicly available GGUF models, and the benchmark harness, prompt suite, and raw results are all open source. You can run the same suite on your own machine and open a pull request to add your hardware's numbers.

What is the single most important takeaway?

Single-user local decoding is memory-bandwidth-bound. Once that clicks, most of these findings follow from it: extra watts, extra CPU threads, and a bigger declared context buy little or no speed; an 8-year-old GPU isn't as far behind as its spec sheet suggests; and a Mixture-of-Experts model that only reads a few billion parameters per token runs at small-model speed. The second lesson: convenient defaults (an aggressive quant, q4 KV cache, "use all the threads", the popular wrapper) are often the wrong call — and they fail quietly.

How can I contribute my own hardware's results?

Each experiment has a results/community/ inbox. Run the suite, bundle your results with the included make_submission.py helper (it auto-captures your CPU, RAM, GPU, PCIe link, board, and storage), and open a pull request. CI validates the submission and rebuilds the experiment's page automatically. See CONTRIBUTING.md in the repo.

Local AILLM InferenceBenchmarksllama.cppGPUInventiveHQ Lab

Need help from an IT & cybersecurity partner?

InventiveHQ helps businesses secure, modernize, and run their technology. Let's talk about your goals.

Get in touch