InventiveHQ Lab

How Small Can a Local LLM Get Before It Can't Reason?

We swept Qwen2.5-Coder from 0.5B to 14B on one GPU. Graded pass-rate climbs 1 to 6 of 10 — but you need about 7B before a model can actually chain reasoning steps.

By InventiveHQ Team

Everyone wants the same thing: the smallest, fastest local model that is still actually useful. Smaller means faster inference, less VRAM, cheaper hardware, and shorter waits. But capability does not scale smoothly with size — there is a threshold. Below it a model is blazing fast and effectively cannot solve anything that takes more than one step; above it, it slows down but starts getting things right.

We wanted to find that threshold precisely, so we swept Qwen2.5-Coder-Instruct across five sizes — 0.5B, 1.5B, 3B, 7B, and 14B — on a single RTX 5060 Ti, measuring both generation speed and a graded reasoning score for each. The result is a sharper knee than we expected. Everything below is reproducible on a stock llama.cpp release, and all of our raw data and the harness are open source.

The results, up front

Pass-rate climbs from 1 of 10 at 0.5B to 6 of 10 at 14B — but every model below 7B is stuck at 0 to 1 of 10, no matter how fast it runs. The 7B is the first size that can actually chain reasoning steps. Here is the whole sweep in one table:

ModelFile sizeVRAMSpeedPass-rate
0.5B531 MB729 MB381 tok/s1 / 10
1.5B986 MB1,259 MB232 tok/s0 / 10
3B1,930 MB2,205 MB157 tok/s1 / 10
7B4,458 MB4,457 MB87 tok/s4 / 10
14B8,988 MB9,179 MB43 tok/s6 / 10

The short version: speed is real but cheap, and below about 7 billion parameters it buys you a model that answers wrong very quickly. The quality knee is a cliff, not a slope — and once you internalize where it sits, picking a model for your VRAM budget gets a lot simpler.

What we measured — and what the score means

We ran all five checkpoints through the same harness on the same hardware, collecting two numbers per model:

  • Generation speed (tok/s) — from the shared 12-prompt suite: greedy decoding (temperature 0), 256 output tokens, the server's own predicted_per_second.
  • Pass-rate — a graded set of 10 multi-step math and reasoning problems with known integer answers, scored by exact match on the last integer in the reply. No partial credit.

One honest caveat up front, because it shapes how you should read everything that follows:

The pass-rate is a reasoning proxy, not executed code. Running untrusted model-generated code safely inside a benchmark harness is non-trivial, so we approximate with a math and word-problem set. The 10 questions each require chaining two to four arithmetic or logical steps — tracking diminishing inventory, computing compound rates, solving simultaneous equations. The model's answer is graded by exact-answer matching only; nothing it writes is ever run. The sample is small, so treat individual scores as signals rather than precise measurements. But the 0-to-1-of-10 pattern across three consecutive model sizes is consistent enough to be a real signal, not noise.

In other words, this measures whether a model can hold a chain of steps in its head and come out with the right number — the exact capability that separates a coding assistant from a glorified autocomplete. It does not measure syntax, style, or whether code compiles. Keep that scope in mind.

Here is the whole sweep on one chart — pass-rate rising while speed falls:

100% 75% 50% 25% 0% 400 300 200 100 0 Pass rate Speed (tok/s) quality knee → 381 232 157 87 43 1/10 0/10 1/10 4/10 6/10 0.5B 1.5B 3B 7B 14B model size · Qwen2.5-Coder-Instruct family · RTX 5060 Ti · CUDA

The green line is pass-rate (left axis); the teal line is generation speed (right axis). The dashed line marks the 7B quality knee — the first size where pass-rate reaches 4 of 10. Everything to its left is 0 to 1 of 10 regardless of how fast it runs.

The setup

We kept the methodology boring so the numbers are comparable: a single GPU with all layers resident, greedy decoding, 256 tokens per prompt, and the same harness for every size.

ComponentWhat we used
GPUNVIDIA RTX 5060 Ti (Blackwell, 16 GB, 2025)
Runtimellama.cpp (CUDA build)
ModelsQwen2.5-Coder-Instruct: 0.5B (Q8_0), 1.5B / 3B (Q4_K_M), 7B (Q4_K_S), 14B (Q4_K_M)
Speed testShared 12-prompt suite, greedy (temp 0), 256 tokens — server predicted_per_second
Graded set10 multi-step math/reasoning problems, exact integer match on the last number in the reply

VRAM is the nvidia-smi used-memory delta on model load, measured on an otherwise-idle GPU, so treat it as approximate. (For more on how quantization changes that footprint, see our explainer on LLM quantization.)

Reading the curve

The numbers tell a sharp story. At 0.5B, generation flies at 381 tok/s — but the model passes just 1 of 10 reasoning problems, most likely a lucky guess on a 10-question set. The 1.5B model is three times larger and scores 0 of 10. The 3B model runs at 157 tok/s — still nearly four times the speed of the 14B — and also manages just 1 of 10.

Then something changes at 7B: 4 of 10 correct, the first size where the model reliably follows multi-step chains. The 14B reaches 6 of 10 at 43 tok/s, occupying just under 9.2 GB of VRAM.

The cost of climbing that pass-rate curve is steep. Speed falls from 381 tok/s to 43 tok/s — an 8.9x penalty. File size grows from 531 MB to 8,988 MB — 17x larger. The jump from 7B to 14B alone costs 2x in speed and doubles the VRAM footprint (4,457 MB to 9,179 MB).

A note on that 1.5B dip to 0/10: it looks worse than the 0.5B, but with only 10 questions the signal is noisy — the 0.5B's single correct answer may itself be lucky. The consistent signal is that all three models below 7B cluster at 0 to 1 of 10 regardless of ordering. None of them can chain steps reliably.

What we learned

Models at or below 3B cannot do multi-step reasoning

Every model under 7B scored 0 or 1 out of 10. The graded set is not exotic — questions like "a store sold 3/5 of 120 apples in the morning and 25 more in the afternoon; how many remain?" require tracking two steps. Sub-7B models fail on these consistently, not occasionally. Speed does not compensate: the 0.5B runs at 381 tok/s and still cannot get past 1/10. Fast wrong answers are still wrong.

The knee is at 7B — and it is a cliff, not a slope

The jump from 3B (1/10) to 7B (4/10) is the largest qualitative change in the sweep. At 7B, the model tracks intermediate state through a chain of reasoning steps — something the smaller models fail to do consistently. This is not "a little better"; it is the difference between a model that basically cannot reason and one that actually solves problems. If you need multi-step reasoning at all, 7B is the floor, not one option among equals.

Even 14B isn't perfect — and this is a small proxy

The 14B model scores 6 of 10 — the best result, but it still fails on 4 questions. Some failures may be genuinely hard problems; some may be prompt sensitivity on a small set. The 10-question graded set is a signal, not a full benchmark. A real code-execution sandbox with a larger, diverse suite would give a more complete picture. That said, this experiment answers the core question cleanly: models below 7B cannot reliably chain reasoning steps, and the proxy is consistent enough to trust that finding.

Speed and size costs are steep and non-linear

From 0.5B to 14B, speed falls 8.9x and file size grows 17x. The non-linearity matters: the 3B is four times the file size of the 0.5B but no better on pass-rate. The 14B is double the file and VRAM of the 7B for a 50% improvement in pass-rate (4/10 to 6/10). There is no cheap step between 3B and 7B — you commit to the full size jump or you gain nothing.

Picking a model for your VRAM budget

This is the part that actually changes what you download. Map your card's free VRAM onto the knee:

  • Under ~4 GB VRAM (or no GPU): you are limited to sub-7B models. That is fine for single-step work — autocomplete, short lookups, simple code completion, summarization where precision matters less — but do not ask these models to reason through anything multi-step. A 0.5B at 381 tok/s is a fast, cheap autocomplete engine, not a problem-solver.
  • ~6 GB VRAM: you clear the floor. The 7B (Q4_K_S) used about 4.5 GB here, leaving room for context. This is the smallest setup that genuinely reasons — slower at 87 tok/s, but it actually solves problems. For most people this is the sweet spot, and it is exactly where a model like Qwen3-Coder runs comfortably on Ollama.
  • 12–16 GB VRAM: the 14B (Q4_K_M, ~9.2 GB) fits with headroom and buys you the extra reliability (6/10 vs 4/10) at another 2x speed penalty. Worth it when correctness matters more than latency.

If you are still deciding what your hardware can run at all, our guides on what tokens-per-second to expect locally and running local AI end to end cover the sizing math in more depth. The one-line rule from this experiment: pick the smallest model that clears your task's reasoning bar — and that bar sits at 7B.

The smaller models are not useless. They are fast and capable for single-step tasks. Just do not ask them to reason through a problem that takes more than one step, because the data is unambiguous: below 7B, they can't.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder-Instruct GGUF files. Download the checkpoints into your models directory, then run the sweep — missing models are skipped automatically:

# run the model-size sweep — missing models are skipped automatically
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py   # writes results/data.json
python scripts/inject.py      # bakes data into site/index.html

This is experiment #11 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where the reasoning floor really sits across different model families and hardware. We have an RTX 5060 Ti and Qwen2.5-Coder; we want your card and your model family. A different quant, a Llama or Mistral sweep, an Apple Silicon run — every data point sharpens the picture. The benchmark, the graded question set, and our raw results live in the smallest-model-that-can-code experiment; submissions are licensed CC BY 4.0 and folded into a growing comparison.


New to running models on your own hardware? Start with our guides on running an LLM locally, what local performance to expect, and LLM quantization.

Frequently Asked Questions

What is the smallest local model that can actually reason?

In our sweep of Qwen2.5-Coder-Instruct from 0.5B to 14B, 7B was the first size that reliably passed multi-step reasoning problems — 4 of 10 on our graded set, versus 0 to 1 of 10 for every model below it. The 14B did better still at 6 of 10. The practical floor for any task that requires chaining two or more steps is about 7 billion parameters. Below that, models are fast but cannot track intermediate state through a problem.

Are smaller models useless then?

No. Sub-3B models are genuinely fast — the 0.5B ran at 381 tok/s on our RTX 5060 Ti — and they are perfectly capable for single-step work: autocomplete, short lookups, simple code completion, and summarization where precision matters less. They only fall apart when you ask them to reason through a problem that takes more than one step. Match the model to the job and a small model is a great tool; ask it to plan or solve, and it will fail fast and confidently.

How did you measure 'can it code' without running the code?

We did not execute model-generated code — running untrusted code inside a benchmark harness is unsafe, so we used a reasoning proxy instead. The graded set is 10 multi-step math and word problems with known integer answers, and each reply is scored by exact match on the last integer it produces. No partial credit, no code execution. It measures whether a model can chain arithmetic and logical steps, which is the capability that separates a useful coding model from a fast autocomplete engine. A real code-execution sandbox is a planned future addition.

Why did the 1.5B model score worse (0/10) than the 0.5B (1/10)?

With only 10 questions, individual scores are noisy. The 0.5B's single correct answer was probably a lucky guess, and the 1.5B's zero is within the same noise band. The signal that matters is not the exact ordering — it is that all three models below 7B cluster at 0 to 1 of 10. None of them can chain reasoning steps reliably, regardless of which one happens to luck into a point.

How much VRAM do I need to cross the 7B reasoning floor?

The 7B model (Q4_K_S) used about 4.5 GB of VRAM in our test, so a 6 GB card clears it with room for context. The 14B (Q4_K_M) used about 9.2 GB, which fits comfortably on a 12 GB or 16 GB card. If you are stuck under roughly 4 GB of VRAM, you are limited to sub-7B models — fine for single-step tasks, but not for reasoning.

Does this mean bigger is always better?

Up to a point. The cost of climbing the pass-rate curve is steep and non-linear: from 0.5B to 14B, speed fell 8.9x and file size grew 17x. The 3B is four times the file size of the 0.5B but no better on pass-rate, and the 14B is double the size and VRAM of the 7B for a jump from 4/10 to 6/10. There is no cheap step between 3B and 7B — you commit to the full size jump or you gain nothing. Beyond 14B, returns keep diminishing, so pick the smallest model that clears your task's reasoning bar.

Can I reproduce this on my own hardware?

Yes. Everything runs on a stock llama.cpp release with public Qwen2.5-Coder-Instruct GGUF files. Download the checkpoints, run the sweep script, and the harness skips any models you don't have. The benchmark code, the graded question set, and our raw results are all in the open-source repo, and we collect community results from other hardware by pull request.

Local AILLM InferenceCoding ModelsModel SizeQwenBenchmarks

Need help from an IT & cybersecurity partner?

InventiveHQ helps businesses secure, modernize, and run their technology. Let's talk about your goals.

Get in touch