Everyone wants the same thing: the smallest, fastest local model that is still actually useful. Smaller means faster inference, less VRAM, cheaper hardware, and shorter waits. But capability does not scale smoothly with size — there is a threshold. Below it a model is blazing fast and effectively cannot solve anything that takes more than one step; above it, it slows down but starts getting things right.
We wanted to find that threshold precisely, so we swept Qwen2.5-Coder-Instruct across five sizes — 0.5B, 1.5B, 3B, 7B, and 14B — on a single RTX 5060 Ti, measuring both generation speed and a graded reasoning score for each. The result is a sharper knee than we expected. Everything below is reproducible on a stock llama.cpp release, and all of our raw data and the harness are open source.
The results, up front
Pass-rate climbs from 1 of 10 at 0.5B to 6 of 10 at 14B — but every model below 7B is stuck at 0 to 1 of 10, no matter how fast it runs. The 7B is the first size that can actually chain reasoning steps. Here is the whole sweep in one table:
| Model | File size | VRAM | Speed | Pass-rate |
|---|---|---|---|---|
| 0.5B | 531 MB | 729 MB | 381 tok/s | 1 / 10 |
| 1.5B | 986 MB | 1,259 MB | 232 tok/s | 0 / 10 |
| 3B | 1,930 MB | 2,205 MB | 157 tok/s | 1 / 10 |
| 7B | 4,458 MB | 4,457 MB | 87 tok/s | 4 / 10 |
| 14B | 8,988 MB | 9,179 MB | 43 tok/s | 6 / 10 |
The short version: speed is real but cheap, and below about 7 billion parameters it buys you a model that answers wrong very quickly. The quality knee is a cliff, not a slope — and once you internalize where it sits, picking a model for your VRAM budget gets a lot simpler.
What we measured — and what the score means
We ran all five checkpoints through the same harness on the same hardware, collecting two numbers per model:
- Generation speed (tok/s) — from the shared 12-prompt suite: greedy decoding (temperature 0), 256 output tokens, the server's own
predicted_per_second. - Pass-rate — a graded set of 10 multi-step math and reasoning problems with known integer answers, scored by exact match on the last integer in the reply. No partial credit.
One honest caveat up front, because it shapes how you should read everything that follows:
The pass-rate is a reasoning proxy, not executed code. Running untrusted model-generated code safely inside a benchmark harness is non-trivial, so we approximate with a math and word-problem set. The 10 questions each require chaining two to four arithmetic or logical steps — tracking diminishing inventory, computing compound rates, solving simultaneous equations. The model's answer is graded by exact-answer matching only; nothing it writes is ever run. The sample is small, so treat individual scores as signals rather than precise measurements. But the 0-to-1-of-10 pattern across three consecutive model sizes is consistent enough to be a real signal, not noise.
In other words, this measures whether a model can hold a chain of steps in its head and come out with the right number — the exact capability that separates a coding assistant from a glorified autocomplete. It does not measure syntax, style, or whether code compiles. Keep that scope in mind.
Here is the whole sweep on one chart — pass-rate rising while speed falls:
The green line is pass-rate (left axis); the teal line is generation speed (right axis). The dashed line marks the 7B quality knee — the first size where pass-rate reaches 4 of 10. Everything to its left is 0 to 1 of 10 regardless of how fast it runs.
The setup
We kept the methodology boring so the numbers are comparable: a single GPU with all layers resident, greedy decoding, 256 tokens per prompt, and the same harness for every size.
| Component | What we used |
|---|---|
| GPU | NVIDIA RTX 5060 Ti (Blackwell, 16 GB, 2025) |
| Runtime | llama.cpp (CUDA build) |
| Models | Qwen2.5-Coder-Instruct: 0.5B (Q8_0), 1.5B / 3B (Q4_K_M), 7B (Q4_K_S), 14B (Q4_K_M) |
| Speed test | Shared 12-prompt suite, greedy (temp 0), 256 tokens — server predicted_per_second |
| Graded set | 10 multi-step math/reasoning problems, exact integer match on the last number in the reply |
VRAM is the nvidia-smi used-memory delta on model load, measured on an otherwise-idle GPU, so treat it as approximate. (For more on how quantization changes that footprint, see our explainer on LLM quantization.)
Reading the curve
The numbers tell a sharp story. At 0.5B, generation flies at 381 tok/s — but the model passes just 1 of 10 reasoning problems, most likely a lucky guess on a 10-question set. The 1.5B model is three times larger and scores 0 of 10. The 3B model runs at 157 tok/s — still nearly four times the speed of the 14B — and also manages just 1 of 10.
Then something changes at 7B: 4 of 10 correct, the first size where the model reliably follows multi-step chains. The 14B reaches 6 of 10 at 43 tok/s, occupying just under 9.2 GB of VRAM.
The cost of climbing that pass-rate curve is steep. Speed falls from 381 tok/s to 43 tok/s — an 8.9x penalty. File size grows from 531 MB to 8,988 MB — 17x larger. The jump from 7B to 14B alone costs 2x in speed and doubles the VRAM footprint (4,457 MB to 9,179 MB).
A note on that 1.5B dip to 0/10: it looks worse than the 0.5B, but with only 10 questions the signal is noisy — the 0.5B's single correct answer may itself be lucky. The consistent signal is that all three models below 7B cluster at 0 to 1 of 10 regardless of ordering. None of them can chain steps reliably.
What we learned
Models at or below 3B cannot do multi-step reasoning
Every model under 7B scored 0 or 1 out of 10. The graded set is not exotic — questions like "a store sold 3/5 of 120 apples in the morning and 25 more in the afternoon; how many remain?" require tracking two steps. Sub-7B models fail on these consistently, not occasionally. Speed does not compensate: the 0.5B runs at 381 tok/s and still cannot get past 1/10. Fast wrong answers are still wrong.
The knee is at 7B — and it is a cliff, not a slope
The jump from 3B (1/10) to 7B (4/10) is the largest qualitative change in the sweep. At 7B, the model tracks intermediate state through a chain of reasoning steps — something the smaller models fail to do consistently. This is not "a little better"; it is the difference between a model that basically cannot reason and one that actually solves problems. If you need multi-step reasoning at all, 7B is the floor, not one option among equals.
Even 14B isn't perfect — and this is a small proxy
The 14B model scores 6 of 10 — the best result, but it still fails on 4 questions. Some failures may be genuinely hard problems; some may be prompt sensitivity on a small set. The 10-question graded set is a signal, not a full benchmark. A real code-execution sandbox with a larger, diverse suite would give a more complete picture. That said, this experiment answers the core question cleanly: models below 7B cannot reliably chain reasoning steps, and the proxy is consistent enough to trust that finding.
Speed and size costs are steep and non-linear
From 0.5B to 14B, speed falls 8.9x and file size grows 17x. The non-linearity matters: the 3B is four times the file size of the 0.5B but no better on pass-rate. The 14B is double the file and VRAM of the 7B for a 50% improvement in pass-rate (4/10 to 6/10). There is no cheap step between 3B and 7B — you commit to the full size jump or you gain nothing.
Picking a model for your VRAM budget
This is the part that actually changes what you download. Map your card's free VRAM onto the knee:
- Under ~4 GB VRAM (or no GPU): you are limited to sub-7B models. That is fine for single-step work — autocomplete, short lookups, simple code completion, summarization where precision matters less — but do not ask these models to reason through anything multi-step. A 0.5B at 381 tok/s is a fast, cheap autocomplete engine, not a problem-solver.
- ~6 GB VRAM: you clear the floor. The 7B (Q4_K_S) used about 4.5 GB here, leaving room for context. This is the smallest setup that genuinely reasons — slower at 87 tok/s, but it actually solves problems. For most people this is the sweet spot, and it is exactly where a model like Qwen3-Coder runs comfortably on Ollama.
- 12–16 GB VRAM: the 14B (Q4_K_M, ~9.2 GB) fits with headroom and buys you the extra reliability (6/10 vs 4/10) at another 2x speed penalty. Worth it when correctness matters more than latency.
If you are still deciding what your hardware can run at all, our guides on what tokens-per-second to expect locally and running local AI end to end cover the sizing math in more depth. The one-line rule from this experiment: pick the smallest model that clears your task's reasoning bar — and that bar sits at 7B.
The smaller models are not useless. They are fast and capable for single-step tasks. Just do not ask them to reason through a problem that takes more than one step, because the data is unambiguous: below 7B, they can't.
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder-Instruct GGUF files. Download the checkpoints into your models directory, then run the sweep — missing models are skipped automatically:
# run the model-size sweep — missing models are skipped automatically
python scripts/bench.py --backend cuda --gpu 0
python scripts/aggregate.py # writes results/data.json
python scripts/inject.py # bakes data into site/index.html
This is experiment #11 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where the reasoning floor really sits across different model families and hardware. We have an RTX 5060 Ti and Qwen2.5-Coder; we want your card and your model family. A different quant, a Llama or Mistral sweep, an Apple Silicon run — every data point sharpens the picture. The benchmark, the graded question set, and our raw results live in the smallest-model-that-can-code experiment; submissions are licensed CC BY 4.0 and folded into a growing comparison.
New to running models on your own hardware? Start with our guides on running an LLM locally, what local performance to expect, and LLM quantization.
