How many threads should I set for CPU LLM inference?

Start at your physical core count, not your logical (hyperthreaded) core count. On our 6-physical-core / 12-logical i7-8700, throughput climbed steeply while physical cores were being filled, then flattened. The measured optimum was 8 threads — two above the physical-core count — because a small amount of hyperthreading helps hide memory latency. The reliable rule is "physical cores, then try +2," and never default to all logical cores.

Why did more threads make my model slower?

CPU token generation is bound by memory bandwidth, not raw compute. Every token requires streaming the entire model weight matrix through memory. Once each physical core already has a thread, adding more threads via hyperthreading does not add bandwidth — the two logical cores on a physical core share the same execution units, caches, and memory channel. So extra threads split the same fixed bandwidth while adding OS scheduling overhead, which can push throughput down.

What is the difference between physical and logical cores for this?

A physical core is real silicon with its own execution units. Hyper-Threading (Intel SMT) presents each physical core as two logical cores to the operating system, but those two logical cores share the same hardware and memory bandwidth. For compute-light, bandwidth-heavy work like LLM token generation, the second logical core on a core mostly competes for bandwidth rather than adding it — which is why logical core count is the wrong number to use.

Does using all my CPU cores speed up llama.cpp?

Not for token generation. On our test, running at 12 threads (all logical cores, the OS and many tool defaults) was 2.4% slower than running at 8. The gap is small but it consistently goes the wrong direction. The biggest gains all happened inside the physical-core zone: going from 1 to 6 threads more than doubled throughput; everything past the peak was marginal or negative.

Does thread count change the quality of the model's output?

No. Thread count only affects how fast tokens are generated, not what tokens are produced. You are tuning throughput and latency, never accuracy. The only thing that changes between 1 thread and 12 is wall-clock speed.

More CPU Threads Made My LLM Slower: A Thread-Scaling Test

Q: Do these thread-scaling results apply to other CPUs?

The shape of the curve generalizes — throughput rises while physical cores fill, then flattens or dips into hyperthread territory — but the exact peak depends on your CPU architecture, memory speed, BIOS power settings, OS scheduler, and llama.cpp version. The penalty for overshooting is larger on CPUs where logical cores far outnumber physical (for example 8P/16L or 12P/24L). Run the sweep on your own hardware to find your number.

A common piece of local-AI advice is "give it all your cores." It sounds obviously right — more threads, more parallelism, more speed. So we tested it: we swept thread counts from 1 to 12 on a 6-physical-core CPU and watched the tokens-per-second curve. The result contradicted the advice. Throughput peaked at 8 threads, and running at all 12 logical cores was slower than running at 8.

This is experiment #9 in our open-source local-LLM benchmark series, and like the rest of it, everything below is reproducible on a stock llama.cpp release. Here's exactly where the curve bends, and why core count is the wrong number to optimize for.

The results, up front

We ran Qwen2.5-Coder-3B-Instruct CPU-only on an Intel Core i7-8700 — 6 physical cores, 12 logical (Hyper-Threading on) — and measured mean generation throughput at each thread count over the same 12-prompt suite we use across the series.

Threads	tok/s	Notes
1	5.64	single core
2	9.15	+62% over 1 thread
4	11.76	still climbing
6	11.57	= physical core count; marginally below 4
8	13.0	measured peak
12	12.69	all logical cores — 2.4% slower than 8

The short version: the steep gains all happen while you are filling physical cores. Past that, hyperthreading buys you a small, fragile bonus at best, and the OS default of "use every logical core" is measurably the wrong setting. Below is why.

Why thread count matters for CPU inference

GPU inference is bound by fixed GPU parallelism — you don't pick a thread count. CPU inference is different: it scales with thread count, but only up to a hard ceiling set by your physical core count.

The reason traces straight back to how token generation works. Each forward pass has to stream the entire model's weight matrix through memory to produce a single token, and there is almost no weight reuse between tokens. That makes generation memory-bandwidth-bound, not compute-bound. More threads help only because more cores can cooperate on that memory transfer — and only while each thread is running on its own dedicated silicon.

Hyper-Threading (Intel's SMT) lets each physical core present itself as two logical cores to the OS. But those two logical cores share the same execution units, the same L1/L2 cache, and — the part that matters here — the same memory bandwidth. Once every physical core already has a thread, adding more threads via hyperthreading does not give you more bandwidth. It splits the same bandwidth across competing threads and adds scheduling overhead on top.

The practical consequence: tools and OS defaults that set --threads to your logical core count are wrong for LLM inference. Your physical core count (or slightly above) is almost always the sweet spot. This is the same memory-bandwidth ceiling that governs our MoE-on-CPU benchmark and shapes what tokens-per-second to expect from local models generally.

How it works — physical cores vs hyperthreads

The pattern plays out on any CPU thread sweep: throughput climbs steeply while physical cores are being filled — that's real bandwidth scaling — then flattens or dips the moment you cross into hyperthread territory, where threads compete for resources they already share.

few threads (2t)

9.2 tok/s 2 cores active · 4 idle bandwidth going to waste

sweet spot (8t) ← peak

13.0 tok/s 6 cores active · 2 share a hyperthread bandwidth well utilised

too many threads (12t)

12.7 tok/s 6 cores · 2 threads each · no new bandwidth scheduling overhead eats the marginal gain

Each green box is a physical CPU core. The darker stripe inside a box means a hyperthread is co-occupying that core's execution resources and memory bandwidth. At 8 threads (the measured peak) most cores get one dedicated thread and only two are sharing. At 12 threads every core is double-booked, and throughput drops back.

The setup

We kept the methodology deliberately boring so the numbers are comparable across the series: one small model, CPU-only inference, thread count swept across the full logical-core range. Both the compute threads (-t) and the batch threads (-tb) were set to the same value at each step. Throughput is mean generation tok/s over the shared 12-prompt suite, 256-token output, greedy decoding (temperature 0).

Component	What we used
CPU	Intel Core i7-8700 — 6 physical cores, 12 logical (Hyper-Threading on), 3.20 GHz
Model	Qwen2.5-Coder-3B-Instruct-Q4_K_M (~2.0 GB)
Runtime	llama.cpp, CPU-only (`-ngl 0 --device none`)
Threads tested	1, 2, 4, 6, 8, 12
Prompt suite	12 prompts · code / math / reasoning / summarization / chat · 256-token output each

No GPU is involved — this is pure CPU inference, the regime most people hit on a laptop or a home server without a discrete card. If you're deciding whether you even need a GPU, our notes on how much VRAM it takes to run an LLM cover the other side of that tradeoff.

Reading the curve

The shape tells the expected story — with one small twist worth flagging honestly.

Throughput climbs sharply from 1 to 2 threads (5.64 → 9.15 tok/s, +62%) and keeps rising through 4 threads (11.76 tok/s). Then it does something slightly odd: 6 threads — exactly the physical core count — is marginally slower than 4 (11.57 vs 11.76 tok/s). That's a common quirk when all six cores are simultaneously contending for the memory controller at the same pressure point; the controller, not the cores, is the bottleneck.

Then 8 threads climbs to the actual measured peak, 13.0 tok/s. A modest number of hyperthreads can help here: when one logical core stalls waiting on memory, the other can keep the execution units busy, so a small amount of HT hides latency and comes out net-positive.

Finally, 12 threads — all logical cores fully double-booked — drops back to 12.69 tok/s, a 2.4% regression from 8. The "use all cores" default is measurably wrong on this machine.

What we learned

The physical core count is the inflection point, not the exact peak. On the i7-8700, the measured optimum was 8 threads, two above the 6 physical cores. A little hyperthreading helped because it let cores hide memory-access latency during stalls. But there's a cliff past it: beyond 8, adding threads into an already-saturated memory bus made scheduling overhead outweigh any latency-hiding benefit, and tok/s fell. The rule of thumb isn't "stop at physical cores" — it's "start at physical cores and try +2, then stop well before logical cores."

The biggest gains live entirely inside the physical-core zone. Going from 1 thread to 6 took throughput from 5.64 to 11.57 tok/s — a 2.05× gain. The most you could squeeze beyond that, all the way to the 8-thread peak, was another ~12% to reach 13.0 tok/s. Everything interesting happens before the hyperthread boundary; everything after is marginal or negative.

Using all logical cores is reliably sub-optimal for this workload. Running at 12 threads — the OS default and many tool defaults — was slower than 8, consistently, not just on one prompt. The gap is small (2.4%) but it always points the wrong way; there was no scenario where 12 beat 8 here. And the penalty scales with how far "use everything" overshoots: on a 8P/16L or 12P/24L chip, defaulting to all logical cores leaves much more on the table than it did on this 6P/12L part.

Practical takeaway: set --threads to your physical core count as a starting baseline, then try +2 to see whether a small HT bonus materializes on your CPU. Never use the logical core count as your default. On an i7-8700 (6 physical / 12 logical) the measured optimum was 8 threads; on most modern desktop CPUs the answer will land at physical cores ± 2.

It's worth being clear about scope. This is one model, one CPU, one llama.cpp build. The shape of the curve — climb through physical cores, flatten or dip into hyperthreads — is general, because the underlying memory-bandwidth ceiling is general. But the exact peak shifts with CPU architecture, memory speed, BIOS power settings, OS scheduler, and llama.cpp version. The point isn't "always use 8 threads." It's that the right number is close to your physical core count, and you can find it in about two minutes by sweeping it yourself. If you're just getting started with local models, our guide on how to run an LLM locally and the broader complete guide to running local AI are good companions to this tuning.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with a public Qwen2.5-Coder GGUF. To sweep thread counts on your own CPU:

# drop Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf into your models dir first
export SPECBENCH_MODELS_DIR=/path/to/models
export SPECBENCH_VULKAN_EXE=/path/to/llama-server
export SPECBENCH_MODEL_BASE=Qwen2.5-Coder-3B-Instruct   # default; change to match your file

python scripts/bench.py
python scripts/aggregate.py && python scripts/inject.py

This is where you come in. This is experiment #9 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where the thread-scaling peak actually lands across real hardware. We have a 6-core Coffee Lake data point; we want yours. A modern 8P/16L desktop chip? A 16-core workstation? An Apple Silicon laptop? Each one sharpens the picture of how far past physical cores the sweet spot really sits.

Contributing takes one run and one pull request:

python scripts/make_submission.py --name "my-cpu-label"
# then open a PR adding results/community/my-cpu-label.json

If "use all your cores" turned out to be the wrong default on your machine too — or if your CPU behaved differently — that's exactly the kind of result worth sharing.

More CPU Threads Made My LLM Slower: A Thread-Scaling Test

The results, up front

Why thread count matters for CPU inference

How it works — physical cores vs hyperthreads

The setup

Reading the curve

What we learned

Reproduce it — and send us your numbers

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware

MoE on CPU: 13B-Class Answers at 3B Speed

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

More CPU Threads Made My LLM Slower: A Thread-Scaling Test

The results, up front

Why thread count matters for CPU inference

How it works — physical cores vs hyperthreads

The setup

Reading the curve

What we learned

Reproduce it — and send us your numbers

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

Related articles

Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware

MoE on CPU: 13B-Class Answers at 3B Speed

How to Run an LLM Locally: A Step-by-Step Guide for Beginners