Skip to main content
Home/Blog/llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?
Artificial Intelligence

llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?

We tested speculative decoding in llama.cpp on an RTX 5060 Ti, a GTX 1080 Ti, and a bare CPU. Real benchmarks: where the draft-model trick helps, and where it backfires.

By InventiveHQ Team
llama.cpp Speculative Decoding: Does It Work on Cheap GPUs?

In June 2026, NVIDIA published a striking benchmark: its new dFlash system delivered "up to 15× higher throughput" using speculative decoding on eight DGX B300s. That number is real — but it describes a rack of datacenter accelerators serving many users at once. It got us wondering about the other 99% of people running local models: what does the same trick do on one GPU you can actually buy, or one you already own?

So we ran the experiment. We took the core mechanism behind dFlash — speculative decoding — and tested it in llama.cpp on three very different machines: a brand-new RTX 5060 Ti (Blackwell, ~$570), an 8-year-old GTX 1080 Ti (Pascal, 2017), and a plain CPU with no GPU at all. Everything below is reproducible on a stock llama.cpp release, and all of our raw data and the benchmark harness are open source.

The results, up front

Here's the whole experiment in one table — the best speculative-decoding speedup we measured on each setup, against its own no-speculation baseline:

HardwareModelBaselineBest speculative resultBest speedup
RTX 5060 Ti (Blackwell, 2025)Qwen2.5-Coder 14B42.0 tok/s1.85× on mathup to 1.85×
RTX 5060 Ti (Blackwell, 2025)Qwen2.5-Coder 7B84.7 tok/s0.95× (barely break-even)⚠️ regressed
GTX 1080 Ti (Pascal, 2017)Qwen2.5-Coder 7B50.5 tok/s1.16× on mathup to 1.16×
CPU only (no GPU)Qwen2.5-Coder 3B12.9 tok/s2.03× on mathup to 2.03×

The short version: speculative decoding does work on consumer hardware — but the gains are tens of percent to ~2×, not the 15× headline, and on an already-fast small model it can actually make you slower. Below is exactly why, and how to make sure you land on the winning side.

What is speculative decoding? (the 60-second version)

Normally, an LLM generates text one token at a time: predict a token, feed it back in, predict the next. Each step has to read the entire model from memory, which is why local generation is slow — you are bottlenecked on memory bandwidth, not raw math.

Speculative decoding adds a second, much smaller draft model. On each cycle:

  1. The small draft model quickly guesses the next several tokens.
  2. The big target model verifies all of those guesses in a single parallel pass.
  3. Every guess the target agrees with is accepted for free; the first wrong guess is replaced by the target's own prediction.
Animated diagram of speculative decoding: a small draft model proposes a block of tokens, the target model verifies them in one parallel pass, the first four are accepted, the fifth is replaced by the target's own token, and the sixth is discarded.

Because the target model verifies everything, the output is identical to running the target model alone — we confirmed this by hashing the generations. The only thing that varies is speed, and speed depends entirely on how often the draft model's guesses get accepted. That single number — the acceptance rate — turns out to be the whole game.

The setup — reproducible on any GPU

We kept the methodology deliberately boring so the numbers are comparable: a single GPU with all layers resident, a 4096-token context, greedy decoding (temperature 0), 256 tokens generated per prompt, and a fixed 12-prompt suite spanning code, math, reasoning, summarization, and chat. Throughput is the server's own predicted_per_second — pure generation rate.

ComponentWhat we used
GPU ARTX 5060 Ti — Blackwell (sm_120), 16 GB, 2025, ~$570
GPU BGTX 1080 Ti — Pascal (sm_61), 11 GB, 2017
Target modelQwen2.5-Coder-7B-Instruct (Q4_K_S, ~4.3 GB)
Draft modelQwen2.5-Coder-0.5B-Instruct (Q8_0, ~0.5 GB)
Runtimellama.cpp — both the CUDA-12.4 and Vulkan Windows builds

One important wrinkle: we tested two backends — NVIDIA's CUDA build and the vendor-neutral Vulkan build — and the difference between them turned out to matter as much as speculation itself.

Result 1: the generational baseline

First, a sanity check with no speculation — just raw generation speed, eight years of GPU progress apart.

Baseline generation throughput with no speculation: RTX 5060 Ti reaches 84.7 tok/s on CUDA and 80.4 on Vulkan, while the GTX 1080 Ti reaches 50.5 on CUDA and 57.3 on Vulkan. 90 45 0 tokens / sec 84.7 80.4 50.5 57.3 CUDA Vulkan CUDA Vulkan RTX 5060 Ti (2025) GTX 1080 Ti (2017)

Here is the first surprise: a brand-new Blackwell card is only about 1.48× faster than a GPU from 2017 at single-user generation. Eight years of silicon, datacenter-grade marketing, and the gap is less than 1.5×. The reason ties straight back to the mechanism — single-user decoding is memory-bandwidth-bound, and memory bandwidth has improved far more slowly than raw compute (FLOPS). The 1080 Ti's 484 GB/s of bandwidth is still very much in the game. (If you're sizing hardware, our guide on how much VRAM you need to run an LLM and what tokens-per-second to expect go deeper on this.)

Result 2: does speculation actually help? (The overhead surprise)

Now we turn speculation on. Here's where it gets interesting — and where most "just add a draft model" advice falls apart.

On the 7B target model, adding the 0.5B draft model slowed the RTX 5060 Ti down on the CUDA build — to 0.27×, a nearly 4× regression — despite perfectly healthy acceptance rates. The Vulkan build on the same card barely broke even at 0.95×. The culprit isn't the draft model's guesses; it's per-cycle overhead. When your target model is already fast (~85 tok/s), the fixed cost of orchestrating each speculative round can exceed the time it saves.

The fix is counterintuitive: use a bigger, slower target model. Swap the 7B for a 14B (which still fits comfortably in the 5060 Ti's 16 GB), and the exact same CUDA build flips to a 1.4× overall win — because now each generation step takes long enough that the overhead is amortized.

Overall speedup from draft-model speculation on the RTX 5060 Ti: the 7B model regresses to 0.27x on CUDA and 0.95x on Vulkan, but a 14B model on the same CUDA build reaches 1.4x. 1.0× (no change) 7B · CUDA 7B · Vulkan 14B · CUDA 0.27× 0.95× 1.40×

The lesson: speculative decoding is not a free lunch at batch size 1. It carries real overhead, and whether you come out ahead depends on the target model being slow enough — and the backend being efficient enough — to pay that overhead back. The vendor-neutral Vulkan build consistently incurred less overhead than CUDA on the fast 7B, which was not what we expected from "the optimized path."

Result 3: it's all about the task

Even with the right model, speculation's benefit swings wildly by what you ask it to do. Here's the 14B model on the 5060 Ti, broken down by task category:

14B speculative-decoding speedup by task on the RTX 5060 Ti: math 1.85x, code 1.57x, reasoning 1.37x, summarization 0.97x, chat 0.92x. 1.0× (no change) Math Code Reasoning Summarize Chat 1.85× 1.57× 1.37× 0.97× 0.92×

The pattern is consistent across every machine we tested: the more predictable the text, the bigger the win. Code and math are full of structure the draft model can anticipate (closing brackets, operators, boilerplate), so acceptance rates are high. Open-ended chat is the opposite — every token is a genuine choice, the draft model guesses wrong more often, and you can even fall slightly behind the baseline. Speculative decoding isn't a global "make it faster" switch; it's a bet that pays off in proportion to how guessable your output is.

Bonus: it even works on a CPU with no GPU

We almost didn't run this one. With no GPU at all — just a CPU and a 3B model — baseline generation crawls at 12.9 tok/s. Speculation lifted it to 22.1 tok/s overall (1.72×), and on math it hit 2.03× — the single biggest speedup in the entire experiment.

CPU-only speculative-decoding speedup by task with a 3B model: math 2.03x, code 1.96x, reasoning 1.74x, summarization 1.69x, chat 1.01x. 1.0× (no change) Math Code Reasoning Summarize Chat 2.03× 1.96× 1.74× 1.69× 1.01×

This makes sense once you internalize the bandwidth story: a CPU is even more starved for memory bandwidth than a GPU, so the slow per-step generation gives speculation plenty of headroom to amortize its overhead. If you're running models on a laptop or a home server without a discrete GPU, speculative decoding may be the single highest-leverage flag you can flip.

What about Ollama and LM Studio?

We benchmarked llama.cpp directly because it exposes the speculative-decoding knobs cleanly, but the same mechanism shows up in the popular front-ends — most of them are llama.cpp under the hood. LM Studio exposes a draft-model "speculative decoding" toggle in its UI, and Ollama has been adding draft-model support as the underlying engine matures. The takeaways here transfer directly: pick a same-family draft model, expect the biggest wins on code and math, and don't be surprised if a small, already-fast target model gets slower. (For a fuller comparison of the runners, see Ollama vs LM Studio vs llama.cpp.)

What we learned

  1. Not a free lunch at batch 1. On consumer hardware serving one request, speculation buys you tens of percent — occasionally up to ~1.85× — not multiples, and only on the right tasks.
  2. Overhead can backfire on fast models. Speculative decoding has per-cycle overhead. On an already-fast 7B it lost on CUDA (0.27×); on a slower 14B it won (1.4×). Match the trick to a model slow enough to pay it back.
  3. The task is everything. Code and math win big; open-ended chat barely moves. Acceptance rate is the result.
  4. N-gram speculation is a safe default. The model-free n-gram (prompt-lookup) variant almost never hurts and needs zero extra download — a free win for repetitive workloads.
  5. The generational gap is smaller than the spec sheet. A 2025 Blackwell card beat a 2017 Pascal card by only 1.48× at single-user decode, because this workload is bandwidth-bound, not compute-bound.

How this compares to NVIDIA's 15×

None of this contradicts NVIDIA. dFlash's 15× comes from speculative decoding compounding with very large batch sizes across eight DGX B300s — serving many users at once, where the GPUs are compute-bound and parallel verification pays off enormously. That is a fundamentally different regime from a single person generating one response on one card. The mechanism is identical; the math changes completely once you go from batch-1 latency to high-throughput serving. Both things are true: 15× in the datacenter, and a careful ~1.4–1.85× on your desk.

Reproduce it — and send us your numbers

Everything here runs on a stock llama.cpp release with public Qwen2.5-Coder GGUF models. The fastest way to try it on your own hardware:

# With a same-family draft model
llama-server -m qwen2.5-coder-14b-instruct-q4_k_m.gguf \
  -md qwen2.5-coder-0.5b-instruct-q8_0.gguf \
  --spec-type draft-simple --spec-draft-n-max 5 -ngl 999

# Or model-free n-gram speculation (no draft download needed)
llama-server -m qwen2.5-coder-14b-instruct-q4_k_m.gguf \
  --spec-type ngram-cache --spec-draft-n-max 5 -ngl 999

Watch predicted_per_second in the server's JSON response for throughput, and the draft acceptance lines in the logs to see why you got the speed you did.

This is where you come in. This benchmark is experiment #1 in our open-source local-LLM benchmark series, and the whole point is to build a community map of where speculative decoding helps — and hurts — across real consumer hardware. We have Blackwell, Pascal, and CPU data; we want yours. An RTX 4090? An Apple Silicon Mac? An old mining card? Every data point sharpens the picture.

Contributing takes three steps and one pull request:

cd experiments/speculative-decoding

# 1. Run the benchmark (run only what you can — a single file is a great contribution)
python scripts/bench.py --gpu 0 --backends cuda,vulkan --out results/results.json

# 2. Bundle your hardware info + results into a community submission
python scripts/make_submission.py --name "rtx4090-ryzen7950x" --notes "llama.cpp b9999, Q4_K_M"

# 3. Open a pull request adding just your results/community/<name>.json file

Full instructions are in the repo's CONTRIBUTING guide. Submitted data is licensed CC BY 4.0, and we'll fold it into a growing community comparison. If speculative decoding made your local models faster — or surprised you by making them slower — that's exactly the kind of result worth sharing.


New to running models on your own hardware? Start with our guides on running an LLM locally, choosing between Ollama, LM Studio, and llama.cpp, and LLM quantization.

Frequently Asked Questions

Find answers to common questions

Speculative decoding speeds up large-language-model generation by pairing a big, accurate "target" model with a small, fast "draft" model. The draft model cheaply guesses several tokens ahead; the target model then verifies all of those guesses in a single parallel pass instead of generating one token at a time. Accepted guesses are free speedups, and any wrong guess is simply replaced by the target model's own prediction — so the output is identical to what the target model would have produced alone. It is a latency optimization, not a quality trade-off.

No. This is the key property that makes it attractive. The large target model still verifies every token, so the final text is byte-for-byte identical to running the target model normally. We confirmed this by hashing the output of speculative and non-speculative runs — they matched. You are only trading some extra compute and memory for lower latency, never accuracy.

NVIDIA's 15× figure comes from dFlash running across eight DGX B300s serving many requests at once (large batch sizes), where speculative decoding compounds with massive parallelism. On a single consumer GPU serving one request at a time (batch size 1), single-user decoding is bound by memory bandwidth, not compute, so the realistic gains are tens of percent — up to about 1.85× in our best case with a 14B model on a math task. The trick is real; the 15× headline just describes a very different operating regime.

It helps most when (1) the target model is large/slow enough that per-step overhead is small relative to generation time, and (2) the task is predictable enough that the draft model's guesses are frequently accepted. Code and math benefited the most in our tests; open-ended chat benefited the least. On an already-fast small model, the overhead of running a draft model can actually make generation slower.

N-gram (or "prompt lookup") speculation is a model-free variant: instead of a draft model, it predicts the next tokens by looking for repeated patterns already present in the prompt and prior output. It requires no extra model download and almost never hurts performance, which makes it a safe default — especially for tasks with lots of repetition like editing code or summarizing a document you pasted in.

Yes — that is the whole point. Everything runs on a stock llama.cpp release with publicly available Qwen2.5-Coder GGUF models. The benchmark harness, the 12-prompt suite, and our raw results are all in the open-source repo, and we are actively collecting community results from other hardware via pull request. If you have a GPU (or even just a CPU), your numbers are welcome.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.