In our first local-LLM benchmark, we established the one number that decides whether speculative decoding is worth turning on: the acceptance rate — how often the big target model agrees with the small draft model's guesses. But that experiment left a tempting question hanging. If a higher acceptance rate is good, and a bigger draft model guesses better, then surely a bigger draft model is faster?
So we tested it directly. We fixed the target model — Qwen2.5-Coder-14B-Instruct (Q4_K_M), baseline 42.45 tok/s on an RTX 5060 Ti — and varied only the draft model: 0.5B, 1.5B, 3B. The same family, the same task suite, the same hardware. Just the draft size changing. Everything below is reproducible on a stock llama.cpp release, and the raw data and harness are open source.
The result is the opposite of the intuition: the smallest draft won, and the biggest draft — the one that guessed almost perfectly — finished last.
The results, up front
Here is the whole sweep in one table. The target is fixed; only the draft model changes. Speedup is each draft's throughput divided by the 42.45 tok/s no-speculation baseline.
| Draft model | File size | Throughput | Speedup | Acceptance rate | VRAM used |
|---|---|---|---|---|---|
| 0.5B (Q8_0) | 531 MB | 58.25 tok/s | 1.37× | 26% | 9,769 MB |
| 1.5B (Q4_K_M) | 986 MB | 56.01 tok/s | 1.32× | 40% | 10,295 MB |
| 3B (Q4_K_M) | 1,930 MB | 52.91 tok/s | 1.25× | 87% | 11,244 MB |
Read it twice. Acceptance climbs from 26% to 87% as the draft grows — a huge improvement in guessing accuracy. And net speedup falls the entire way: 1.37× → 1.32× → 1.25×. The 3B draft agrees with the target nearly every token, and it is still the slowest of the three. The smallest, worst-guessing draft is the fastest overall.
Why this happens: it's a product, not a sum
The reason traces straight back to the overhead-vs-acceptance tradeoff from experiment #1. Speculative decoding is not free. Every draft round costs something — the draft model has to run before the target can verify. So your total wall-clock time is governed by two factors multiplied together:
- Acceptance — how many tokens you get to keep per expensive target pass. Higher is better.
- Draft cost — how long the draft model takes to propose those tokens. Lower is better.
A bigger draft improves the first number and worsens the second. The question is which one moves faster. On a single consumer GPU at batch size 1, the answer is unambiguous: draft cost rises faster than acceptance pays back.
Going from 0.5B to 3B is a 6× increase in draft model size. Those extra weights have to be read out of GPU memory on every single draft step — and single-user decoding is bound by memory bandwidth, not compute, exactly as we found measuring the tokens-per-second you can expect locally. The 3B draft is roughly 3.6× larger than the 0.5B, sharing the same memory bus as the 14B target. It guesses beautifully. It just takes too long to do it.
The size–accuracy tradeoff, visualized
Each draft size strikes a different balance between acceptance and per-step cost. Read the diagram left to right: acceptance climbs steeply as the draft grows, but so does draft cost — and cost wins. Net speedup peaks at the smallest model.
Acceptance rate and per-step cost both rise with draft model size — but cost grows faster. The 0.5B wins on net speedup despite having the lowest acceptance rate.
The setup — boring on purpose
We kept the methodology deliberately dull so the only thing changing is the variable under test. The target is fixed: Qwen2.5-Coder-14B-Instruct-Q4_K_M (~9 GB), CUDA backend, GPU 0. We measure a plain no-speculation baseline first (42.45 tok/s), then sweep three drafts, all with the same flags — --spec-type draft-simple --spec-draft-n-max 5 -ngld 999 (5 tokens drafted per round, draft model fully on GPU). Greedy decoding, 256 tokens per prompt, the shared 12-prompt suite spanning code, math, reasoning, summarization, and chat.
| Component | What we used |
|---|---|
| GPU | RTX 5060 Ti — Blackwell (sm_120), 16 GB |
| Runtime | llama.cpp, CUDA backend |
| Target model | Qwen2.5-Coder-14B-Instruct-Q4_K_M (~9 GB) |
| Baseline | 42.45 tok/s — target only, no draft |
| Drafts swept | 0.5B Q8_0 (531 MB) · 1.5B Q4_K_M (986 MB) · 3B Q4_K_M (1,930 MB) |
A note on the draft quantizations: the 0.5B runs at Q8_0 and the larger two at Q4_K_M, which is the practical way you'd actually deploy each — a 0.5B is small enough to keep at near-full precision, while bigger drafts get quantized harder to stay cheap. (If quantization's quality cost is new to you, our GGUF quantization quality benchmark measures exactly what you give up at each level.)
What the sweep teaches
Acceptance rises steeply — and nonlinearly — with draft size. Going from 0.5B to 3B takes acceptance from 26% to 87%. A bigger same-family draft really does predict the target's tokens far more reliably; the capacity difference is real and large. It is also front-loaded at the top end: 0.5B → 1.5B adds 14 points, while 1.5B → 3B adds another 47. Most of the accuracy gain is concentrated where it does you the least good.
Net speedup falls with every step. 1.37× → 1.32× → 1.25×. Each size increase costs roughly 0.05–0.07× of net speedup. You are trading wall-clock time for better guessing accuracy you don't need — accepting more tokens per round matters only if accepting them is cheap.
VRAM is not the binding constraint. This is the tell. Even the 3B draft adds only about 1,475 MB over the 0.5B configuration (9,769 MB vs 11,244 MB total). Fitting a bigger draft was never the problem. A modest VRAM footprint still carries a compute-and-bandwidth cost per draft step, and that cost is what erodes the speedup — not memory capacity. If you were sizing this by VRAM headroom alone, you'd reach for the biggest draft that fits and end up slowest.
The 0.5B is the sweet spot here — not because it guesses well. It clearly doesn't; 26% is the worst of the three. It wins because it is cheap enough that even 26% acceptance is meaningful leverage on a 14B target that only manages 42.45 tok/s on its own. A 531 MB draft is nearly free next to a 9 GB target, and that ratio — slow target, fast draft — is what determines the winner.
The rule: pick the smallest competent draft
If you want one line to take away, it is this:
Pick the smallest draft model that keeps acceptance in a productive range — then stop. A cheap draft amortizes better than an accurate one. On this setup, a 0.5B drafting for a 14B already delivers 1.37×; scaling up only costs you speedup.
"Competent" is doing real work in that sentence. A draft that's too small or from the wrong family can guess so badly that even its low cost can't save it — that's why a same-family pairing (Qwen-Coder drafting for Qwen-Coder) matters, and why the smallest model that can actually code is a useful reference point for the floor. But once you clear that competence bar, smaller is the safer bet on a single GPU. Start at the bottom of the size range and only move up if acceptance is so low that you're regressing.
Two caveats keep this honest. First, acceptance is task-dependent — code and math are predictable and accept well; open-ended chat is a token-by-token coin flip, and the whole curve shifts. Second, the crossover is hardware-dependent — change the target family, the task mix, or the GPU and the sweet spot can move. The principle is durable; the exact winning size is not. When in doubt, run the sweep yourself — it takes minutes.
Reproduce it — and send us your numbers
Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder GGUF models. Drop the target and any draft GGUFs into your models directory, then:
# Run the sweep — missing draft models are simply skipped
python scripts/bench.py --gpu 0
python scripts/aggregate.py && python scripts/inject.py
Watch predicted_per_second in the server's JSON for throughput, and the draft-acceptance lines in the log to see why you landed where you did.
This is experiment #10 in our open-source local-LLM benchmark series, and the sequel to experiment #1 on speculative decoding — read that first for the full overhead-vs-acceptance framing. We have one rig's numbers; the interesting question is whether the crossover sits at the same place on yours. A 4090 with bandwidth to spare might tolerate a bigger draft. An Apple Silicon Mac might not. Every data point sharpens the map. The contribution flow is a single pull request — scripts/make_submission.py auto-captures your hardware — and submitted data is licensed CC BY 4.0.
New to running models on your own hardware? Start with our guides on running an LLM locally and the complete guide to running local AI, then come back and pick a draft model the cheap way.
