What is the draft model in speculative decoding?

The draft model is a small, fast model that cheaply guesses the next several tokens, which the large "target" model then verifies in a single parallel pass. Accepted guesses are produced for free; wrong guesses are replaced by the target's own token, so the output is identical to running the target alone. The draft model only affects speed, never quality — and its size is the variable we sweep in this experiment.

Does a bigger draft model make speculative decoding faster?

Not on a single consumer GPU. We swept 0.5B, 1.5B, and 3B draft models against a fixed Qwen2.5-Coder-14B target on an RTX 5060 Ti, and net speedup fell with every size increase: 1.37× → 1.32× → 1.25×. A bigger draft does accept more tokens per round, but it also costs more to run on every draft step, and that per-step cost rises faster than the acceptance benefit. The smallest draft won.

Why did the 3B draft lose despite 87% acceptance?

The 3B draft agreed with the target 87% of the time — nearly every token — yet finished slowest at 52.91 tok/s (1.25×). Acceptance rate is necessary but not sufficient. Total wall-clock time is acceptance multiplied against draft cost, and a 3B model sharing the same GPU memory bandwidth as a 14B target is expensive to run on every draft cycle. That overhead swamps its accuracy advantage.

What draft model size should I use for speculative decoding?

Pick the smallest same-family draft that keeps acceptance in a productive range. On our setup a 0.5B drafting for a 14B target already delivered 1.37× at just 26% acceptance, because the draft is so cheap relative to the target. Scaling up to 1.5B or 3B only cost speedup. The rule: a cheap-and-competent draft beats an expensive-and-accurate one at batch size 1.

Is VRAM the reason bigger drafts are slower?

No. Even the 3B draft only added about 1,475 MB over the 0.5B configuration (9,769 MB vs 11,244 MB total). Fitting a larger draft was never the problem. The bottleneck is compute and memory bandwidth per draft step — every extra weight has to move through the hardware on each speculative round, and that time cost is what erodes the speedup.

Do these draft-size results depend on the task and hardware?

Yes. Acceptance rate is both model-dependent and task-dependent — structured outputs like code and math are far more predictable than open-ended chat — and the crossover point shifts with your target family, task mix, and GPU. The principle holds (cheap draft + slow target = win), but the exact sweet spot can move, so re-measure if you change any of those variables.

Bigger Draft Model = Faster? A Speculative Decoding Sweep

In our first local-LLM benchmark, we established the one number that decides whether speculative decoding is worth turning on: the acceptance rate — how often the big target model agrees with the small draft model's guesses. But that experiment left a tempting question hanging. If a higher acceptance rate is good, and a bigger draft model guesses better, then surely a bigger draft model is faster?

So we tested it directly. We fixed the target model — Qwen2.5-Coder-14B-Instruct (Q4_K_M), baseline 42.45 tok/s on an RTX 5060 Ti — and varied only the draft model: 0.5B, 1.5B, 3B. The same family, the same task suite, the same hardware. Just the draft size changing. Everything below is reproducible on a stock llama.cpp release, and the raw data and harness are open source.

The result is the opposite of the intuition: the smallest draft won, and the biggest draft — the one that guessed almost perfectly — finished last.

The results, up front

Here is the whole sweep in one table. The target is fixed; only the draft model changes. Speedup is each draft's throughput divided by the 42.45 tok/s no-speculation baseline.

Draft model	File size	Throughput	Speedup	Acceptance rate	VRAM used
0.5B (Q8_0)	531 MB	58.25 tok/s	1.37×	26%	9,769 MB
1.5B (Q4_K_M)	986 MB	56.01 tok/s	1.32×	40%	10,295 MB
3B (Q4_K_M)	1,930 MB	52.91 tok/s	1.25×	87%	11,244 MB

Read it twice. Acceptance climbs from 26% to 87% as the draft grows — a huge improvement in guessing accuracy. And net speedup falls the entire way: 1.37× → 1.32× → 1.25×. The 3B draft agrees with the target nearly every token, and it is still the slowest of the three. The smallest, worst-guessing draft is the fastest overall.

Why this happens: it's a product, not a sum

The reason traces straight back to the overhead-vs-acceptance tradeoff from experiment #1. Speculative decoding is not free. Every draft round costs something — the draft model has to run before the target can verify. So your total wall-clock time is governed by two factors multiplied together:

Acceptance — how many tokens you get to keep per expensive target pass. Higher is better.
Draft cost — how long the draft model takes to propose those tokens. Lower is better.

A bigger draft improves the first number and worsens the second. The question is which one moves faster. On a single consumer GPU at batch size 1, the answer is unambiguous: draft cost rises faster than acceptance pays back.

Going from 0.5B to 3B is a 6× increase in draft model size. Those extra weights have to be read out of GPU memory on every single draft step — and single-user decoding is bound by memory bandwidth, not compute, exactly as we found measuring the tokens-per-second you can expect locally. The 3B draft is roughly 3.6× larger than the 0.5B, sharing the same memory bus as the 14B target. It guesses beautifully. It just takes too long to do it.

The size–accuracy tradeoff, visualized

Each draft size strikes a different balance between acceptance and per-step cost. Read the diagram left to right: acceptance climbs steeply as the draft grows, but so does draft cost — and cost wins. Net speedup peaks at the smallest model.

Acceptance rate and per-step cost both rise with draft model size — but cost grows faster. The 0.5B wins on net speedup despite having the lowest acceptance rate.

The setup — boring on purpose

We kept the methodology deliberately dull so the only thing changing is the variable under test. The target is fixed: Qwen2.5-Coder-14B-Instruct-Q4_K_M (~9 GB), CUDA backend, GPU 0. We measure a plain no-speculation baseline first (42.45 tok/s), then sweep three drafts, all with the same flags — --spec-type draft-simple --spec-draft-n-max 5 -ngld 999 (5 tokens drafted per round, draft model fully on GPU). Greedy decoding, 256 tokens per prompt, the shared 12-prompt suite spanning code, math, reasoning, summarization, and chat.

Component	What we used
GPU	RTX 5060 Ti — Blackwell (sm_120), 16 GB
Runtime	llama.cpp, CUDA backend
Target model	Qwen2.5-Coder-14B-Instruct-Q4_K_M (~9 GB)
Baseline	42.45 tok/s — target only, no draft
Drafts swept	0.5B Q8_0 (531 MB) · 1.5B Q4_K_M (986 MB) · 3B Q4_K_M (1,930 MB)

A note on the draft quantizations: the 0.5B runs at Q8_0 and the larger two at Q4_K_M, which is the practical way you'd actually deploy each — a 0.5B is small enough to keep at near-full precision, while bigger drafts get quantized harder to stay cheap. (If quantization's quality cost is new to you, our GGUF quantization quality benchmark measures exactly what you give up at each level.)

What the sweep teaches

Acceptance rises steeply — and nonlinearly — with draft size. Going from 0.5B to 3B takes acceptance from 26% to 87%. A bigger same-family draft really does predict the target's tokens far more reliably; the capacity difference is real and large. It is also front-loaded at the top end: 0.5B → 1.5B adds 14 points, while 1.5B → 3B adds another 47. Most of the accuracy gain is concentrated where it does you the least good.

Net speedup falls with every step. 1.37× → 1.32× → 1.25×. Each size increase costs roughly 0.05–0.07× of net speedup. You are trading wall-clock time for better guessing accuracy you don't need — accepting more tokens per round matters only if accepting them is cheap.

VRAM is not the binding constraint. This is the tell. Even the 3B draft adds only about 1,475 MB over the 0.5B configuration (9,769 MB vs 11,244 MB total). Fitting a bigger draft was never the problem. A modest VRAM footprint still carries a compute-and-bandwidth cost per draft step, and that cost is what erodes the speedup — not memory capacity. If you were sizing this by VRAM headroom alone, you'd reach for the biggest draft that fits and end up slowest.

The 0.5B is the sweet spot here — not because it guesses well. It clearly doesn't; 26% is the worst of the three. It wins because it is cheap enough that even 26% acceptance is meaningful leverage on a 14B target that only manages 42.45 tok/s on its own. A 531 MB draft is nearly free next to a 9 GB target, and that ratio — slow target, fast draft — is what determines the winner.

The rule: pick the smallest competent draft

If you want one line to take away, it is this:

Pick the smallest draft model that keeps acceptance in a productive range — then stop. A cheap draft amortizes better than an accurate one. On this setup, a 0.5B drafting for a 14B already delivers 1.37×; scaling up only costs you speedup.

"Competent" is doing real work in that sentence. A draft that's too small or from the wrong family can guess so badly that even its low cost can't save it — that's why a same-family pairing (Qwen-Coder drafting for Qwen-Coder) matters, and why the smallest model that can actually code is a useful reference point for the floor. But once you clear that competence bar, smaller is the safer bet on a single GPU. Start at the bottom of the size range and only move up if acceptance is so low that you're regressing.

Two caveats keep this honest. First, acceptance is task-dependent — code and math are predictable and accept well; open-ended chat is a token-by-token coin flip, and the whole curve shifts. Second, the crossover is hardware-dependent — change the target family, the task mix, or the GPU and the sweet spot can move. The principle is durable; the exact winning size is not. When in doubt, run the sweep yourself — it takes minutes.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder GGUF models. Drop the target and any draft GGUFs into your models directory, then:

# Run the sweep — missing draft models are simply skipped
python scripts/bench.py --gpu 0
python scripts/aggregate.py && python scripts/inject.py

Watch predicted_per_second in the server's JSON for throughput, and the draft-acceptance lines in the log to see why you landed where you did.

This is experiment #10 in our open-source local-LLM benchmark series, and the sequel to experiment #1 on speculative decoding — read that first for the full overhead-vs-acceptance framing. We have one rig's numbers; the interesting question is whether the crossover sits at the same place on yours. A 4090 with bandwidth to spare might tolerate a bigger draft. An Apple Silicon Mac might not. Every data point sharpens the map. The contribution flow is a single pull request — scripts/make_submission.py auto-captures your hardware — and submitted data is licensed CC BY 4.0.

New to running models on your own hardware? Start with our guides on running an LLM locally and the complete guide to running local AI, then come back and pick a draft model the cheap way.

Bigger Draft Model = Faster? A Speculative Decoding Sweep

The results, up front

Why this happens: it's a product, not a sum

The size–accuracy tradeoff, visualized

The setup — boring on purpose

What the sweep teaches

The rule: pick the smallest competent draft

Reproduce it — and send us your numbers

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?

Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware

How Low Can You Quantize a GGUF Model Before Quality Breaks?

Bigger Draft Model = Faster? A Speculative Decoding Sweep

The results, up front

Why this happens: it's a product, not a sum

The size–accuracy tradeoff, visualized

The setup — boring on purpose

What the sweep teaches

The rule: pick the smallest competent draft

Reproduce it — and send us your numbers

Frequently Asked Questions

Need help from an IT & cybersecurity partner?

Related articles

llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?

Local LLM Performance: What Tokens-Per-Second to Expect From Your Hardware

How Low Can You Quantize a GGUF Model Before Quality Breaks?