InventiveHQ Lab· 13 posts

The Context-Length Tax: What Going 2K to 32K Actually Costs
Going from 2K to 32K context cost essentially 0 tok/s but +1.7 GB of VRAM. We swept Qwen2.5-Coder-7B across five context sizes — the tax is memory, not speed.

More CPU Threads Made My LLM Slower: A Thread-Scaling Test
Throughput peaked near the 6 physical cores on an i7-8700, and 12 threads ran slower than 8. Here's why memory bandwidth — not core count — decides how fast a CPU runs a local model.

Flash Attention in llama.cpp: -fa Is Free Because It's Already On
We swept llama.cpp's -fa flag across 4K, 16K, and 32K context on an RTX 5060 Ti. Speed and VRAM were identical on and off — because this build already defaults flash attention on.

How Low Can You Quantize a GGUF Model Before Quality Breaks?
We swept Qwen2.5-Coder-7B from Q2 to Q8 on an RTX 5060 Ti. Output fidelity goes 21% at Q2, 57% at Q4, 80% at Q6 — the cliff is below Q4, and the sweet spot is Q4–Q5.

I Capped My GPU to 150W and Barely Lost Any Speed
We cut an RTX 5060 Ti's power limit by 17% and lost essentially zero tokens per second while gaining 16% efficiency. Local LLM decoding is memory-bandwidth-bound, not compute-bound — so watts are mostly wasted.

KV-Cache Quantization: The q4_0 Cliff Your Logs Won't Warn You About
We benchmarked f16 vs q8_0 vs q4_0 KV caches on the same model. q8_0 KV is nearly lossless (81.6% similar). q4_0 KV wrecks output quality (8.3%) while saving VRAM — and your throughput dashboard stays green the whole time.

Local LLM Benchmarks: 12 Findings From Consumer Hardware
NVIDIA publishes its inference wins on 8× DGX B300s. Almost nobody has those. So we ran the same ideas on a single consumer RTX 5060 Ti, a 2017 GTX 1080 Ti, and a CPU — and measured what actually happens. Here are all 12 experiments.

MoE on CPU: 13B-Class Answers at 3B Speed
A 30B-A3B Mixture-of-Experts model runs at dense-3B speed on an 8-year-old i7 CPU (10.6 tok/s) yet scores 8/10 on our graded set versus the 3B's 1/10. Here's why, and how to size your own box for it.

Ollama vs llama.cpp vs LM Studio: The Speed Tax, Measured
LM Studio, Ollama, and raw llama.cpp all run the same engine on the same GPU. We measured what the convenience layer costs: LM Studio adds 0.3%, Ollama adds 10%.

How Small Can a Local LLM Get Before It Can't Reason?
We swept Qwen2.5-Coder from 0.5B to 14B on one GPU. Graded pass-rate climbs 1 to 6 of 10 — but you need about 7B before a model can actually chain reasoning steps.

Bigger Draft Model = Faster? A Speculative Decoding Sweep
We swept 0.5B → 3B draft models against a fixed 14B target. The 0.5B won at 1.37× — despite the lowest acceptance rate of the three. Here's why bigger drafts lose.

The VRAM Cliff: 15× Slower the Moment Layers Spill to CPU
We swept -ngl from 0 to 99 on a 14B model: 2.89 → 43 tok/s as it moves onto the GPU. Partial offload is a cliff, not a slope — and the last 8 layers matter most.

llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?
We tested speculative decoding in llama.cpp on an RTX 5060 Ti, a GTX 1080 Ti, and a bare CPU. Real benchmarks: where the draft-model trick helps, and where it backfires.