A common piece of local-AI advice is "give it all your cores." It sounds obviously right — more threads, more parallelism, more speed. So we tested it: we swept thread counts from 1 to 12 on a 6-physical-core CPU and watched the tokens-per-second curve. The result contradicted the advice. Throughput peaked at 8 threads, and running at all 12 logical cores was slower than running at 8.
This is experiment #9 in our open-source local-LLM benchmark series, and like the rest of it, everything below is reproducible on a stock llama.cpp release. Here's exactly where the curve bends, and why core count is the wrong number to optimize for.
The results, up front
We ran Qwen2.5-Coder-3B-Instruct CPU-only on an Intel Core i7-8700 — 6 physical cores, 12 logical (Hyper-Threading on) — and measured mean generation throughput at each thread count over the same 12-prompt suite we use across the series.
| Threads | tok/s | Notes |
|---|---|---|
| 1 | 5.64 | single core |
| 2 | 9.15 | +62% over 1 thread |
| 4 | 11.76 | still climbing |
| 6 | 11.57 | = physical core count; marginally below 4 |
| 8 | 13.0 | measured peak |
| 12 | 12.69 | all logical cores — 2.4% slower than 8 |
The short version: the steep gains all happen while you are filling physical cores. Past that, hyperthreading buys you a small, fragile bonus at best, and the OS default of "use every logical core" is measurably the wrong setting. Below is why.
Why thread count matters for CPU inference
GPU inference is bound by fixed GPU parallelism — you don't pick a thread count. CPU inference is different: it scales with thread count, but only up to a hard ceiling set by your physical core count.
The reason traces straight back to how token generation works. Each forward pass has to stream the entire model's weight matrix through memory to produce a single token, and there is almost no weight reuse between tokens. That makes generation memory-bandwidth-bound, not compute-bound. More threads help only because more cores can cooperate on that memory transfer — and only while each thread is running on its own dedicated silicon.
Hyper-Threading (Intel's SMT) lets each physical core present itself as two logical cores to the OS. But those two logical cores share the same execution units, the same L1/L2 cache, and — the part that matters here — the same memory bandwidth. Once every physical core already has a thread, adding more threads via hyperthreading does not give you more bandwidth. It splits the same bandwidth across competing threads and adds scheduling overhead on top.
The practical consequence: tools and OS defaults that set
--threadsto your logical core count are wrong for LLM inference. Your physical core count (or slightly above) is almost always the sweet spot. This is the same memory-bandwidth ceiling that governs our MoE-on-CPU benchmark and shapes what tokens-per-second to expect from local models generally.
How it works — physical cores vs hyperthreads
The pattern plays out on any CPU thread sweep: throughput climbs steeply while physical cores are being filled — that's real bandwidth scaling — then flattens or dips the moment you cross into hyperthread territory, where threads compete for resources they already share.
Each green box is a physical CPU core. The darker stripe inside a box means a hyperthread is co-occupying that core's execution resources and memory bandwidth. At 8 threads (the measured peak) most cores get one dedicated thread and only two are sharing. At 12 threads every core is double-booked, and throughput drops back.
The setup
We kept the methodology deliberately boring so the numbers are comparable across the series: one small model, CPU-only inference, thread count swept across the full logical-core range. Both the compute threads (-t) and the batch threads (-tb) were set to the same value at each step. Throughput is mean generation tok/s over the shared 12-prompt suite, 256-token output, greedy decoding (temperature 0).
| Component | What we used |
|---|---|
| CPU | Intel Core i7-8700 — 6 physical cores, 12 logical (Hyper-Threading on), 3.20 GHz |
| Model | Qwen2.5-Coder-3B-Instruct-Q4_K_M (~2.0 GB) |
| Runtime | llama.cpp, CPU-only (-ngl 0 --device none) |
| Threads tested | 1, 2, 4, 6, 8, 12 |
| Prompt suite | 12 prompts · code / math / reasoning / summarization / chat · 256-token output each |
No GPU is involved — this is pure CPU inference, the regime most people hit on a laptop or a home server without a discrete card. If you're deciding whether you even need a GPU, our notes on how much VRAM it takes to run an LLM cover the other side of that tradeoff.
Reading the curve
The shape tells the expected story — with one small twist worth flagging honestly.
Throughput climbs sharply from 1 to 2 threads (5.64 → 9.15 tok/s, +62%) and keeps rising through 4 threads (11.76 tok/s). Then it does something slightly odd: 6 threads — exactly the physical core count — is marginally slower than 4 (11.57 vs 11.76 tok/s). That's a common quirk when all six cores are simultaneously contending for the memory controller at the same pressure point; the controller, not the cores, is the bottleneck.
Then 8 threads climbs to the actual measured peak, 13.0 tok/s. A modest number of hyperthreads can help here: when one logical core stalls waiting on memory, the other can keep the execution units busy, so a small amount of HT hides latency and comes out net-positive.
Finally, 12 threads — all logical cores fully double-booked — drops back to 12.69 tok/s, a 2.4% regression from 8. The "use all cores" default is measurably wrong on this machine.
What we learned
The physical core count is the inflection point, not the exact peak. On the i7-8700, the measured optimum was 8 threads, two above the 6 physical cores. A little hyperthreading helped because it let cores hide memory-access latency during stalls. But there's a cliff past it: beyond 8, adding threads into an already-saturated memory bus made scheduling overhead outweigh any latency-hiding benefit, and tok/s fell. The rule of thumb isn't "stop at physical cores" — it's "start at physical cores and try +2, then stop well before logical cores."
The biggest gains live entirely inside the physical-core zone. Going from 1 thread to 6 took throughput from 5.64 to 11.57 tok/s — a 2.05× gain. The most you could squeeze beyond that, all the way to the 8-thread peak, was another ~12% to reach 13.0 tok/s. Everything interesting happens before the hyperthread boundary; everything after is marginal or negative.
Using all logical cores is reliably sub-optimal for this workload. Running at 12 threads — the OS default and many tool defaults — was slower than 8, consistently, not just on one prompt. The gap is small (2.4%) but it always points the wrong way; there was no scenario where 12 beat 8 here. And the penalty scales with how far "use everything" overshoots: on a 8P/16L or 12P/24L chip, defaulting to all logical cores leaves much more on the table than it did on this 6P/12L part.
Practical takeaway: set
--threadsto your physical core count as a starting baseline, then try +2 to see whether a small HT bonus materializes on your CPU. Never use the logical core count as your default. On an i7-8700 (6 physical / 12 logical) the measured optimum was 8 threads; on most modern desktop CPUs the answer will land at physical cores ± 2.
It's worth being clear about scope. This is one model, one CPU, one llama.cpp build. The shape of the curve — climb through physical cores, flatten or dip into hyperthreads — is general, because the underlying memory-bandwidth ceiling is general. But the exact peak shifts with CPU architecture, memory speed, BIOS power settings, OS scheduler, and llama.cpp version. The point isn't "always use 8 threads." It's that the right number is close to your physical core count, and you can find it in about two minutes by sweeping it yourself. If you're just getting started with local models, our guide on how to run an LLM locally and the broader complete guide to running local AI are good companions to this tuning.
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with a public Qwen2.5-Coder GGUF. To sweep thread counts on your own CPU:
# drop Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf into your models dir first
export SPECBENCH_MODELS_DIR=/path/to/models
export SPECBENCH_VULKAN_EXE=/path/to/llama-server
export SPECBENCH_MODEL_BASE=Qwen2.5-Coder-3B-Instruct # default; change to match your file
python scripts/bench.py
python scripts/aggregate.py && python scripts/inject.py
This is where you come in. This is experiment #9 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where the thread-scaling peak actually lands across real hardware. We have a 6-core Coffee Lake data point; we want yours. A modern 8P/16L desktop chip? A 16-core workstation? An Apple Silicon laptop? Each one sharpens the picture of how far past physical cores the sweet spot really sits.
Contributing takes one run and one pull request:
python scripts/make_submission.py --name "my-cpu-label"
# then open a PR adding results/community/my-cpu-label.json
If "use all your cores" turned out to be the wrong default on your machine too — or if your CPU behaved differently — that's exactly the kind of result worth sharing.
More from the InventiveHQ Lab: what tokens-per-second to expect from a local LLM, running a Mixture-of-Experts model on CPU, and squeezing tokens-per-watt out of a power-limited GPU.
