If you've installed Ollama or LM Studio to run a model on your own machine, here's something that surprises most people: you're already running llama.cpp. Both tools ship it as their inference engine. So when you pit "Ollama vs LM Studio vs llama.cpp," you are not watching three engines race. You are watching one engine wearing three different costumes — and the only thing being measured is what each costume weighs.
That makes a question most comparison posts skip suddenly answerable with a number. Everyone benchmarks these tools on features — model hubs, GUIs, Docker support. Almost nobody measures the speed tax of the convenience layer. Since the model, the quantization, and the GPU are identical across all three, any difference in throughput is pure overhead: wrapper version lag, default flags, and the HTTP/scheduling/templating layers each tool stacks on top of the shared engine.
So we measured it. Same 7B model, same GPU, same 12-prompt suite, all three runners hit over their OpenAI-compatible endpoints. The headline finding, up front:
LM Studio costs you 0.3% — within noise. Ollama costs you ~10% — a real, consistent throughput tax. Raw llama.cpp is the ceiling. Everything here is reproducible on a stock setup, and all of our raw data and the harness are open source.
Why "they're all llama.cpp" is the whole story
What you are actually choosing between when you pick a runner is which packaging sits between you and the raw binary:
- llama.cpp —
llama-server, a REST server with no extras. The bare engine. - LM Studio — a GUI application that bundles llama.cpp (plus Apple's MLX on Macs).
- Ollama — a model-management daemon with its own scheduler and template engine, with llama.cpp underneath for GGUF models.
The packaging is the variable. The engine is shared. That's what makes the "convenience tax" precisely answerable instead of hand-wavy. Since the GGUF file, the quant, and the GPU kernels are the same in every run, a speed gap can only come from three places, all of which the wrapper decides for you:
- Which llama.cpp version it bundles (wrappers often lag upstream).
- Default flags — flash-attention, KV-cache type, batch size, context length,
-ngl(how many layers go on the GPU). - Its own overhead — the HTTP server, prompt templating, and scheduling.
We can't pin Ollama or LM Studio to a specific llama.cpp build, and that is exactly the point: their bundled version and their defaults are what you're measuring. When you type ollama run, you're accepting whatever engine build and flags Ollama chose for you. This experiment puts a number on the consequences.
How it works: three wrappers, one engine
Here's the architecture in one picture — three packaging layers stacked on a single shared inference slab, with each one's measured overhead:
All three runners load the same GGUF file and call into the same GPU kernels. The overhead bars (scaled to a 20% maximum) tell the whole story at a glance: LM Studio's bar is barely visible, Ollama's is real.
The setup
We kept the methodology deliberately boring so the numbers are comparable. All three runners serve the identical model and quant — Qwen2.5-Coder-7B (Q4) — over their OpenAI-compatible endpoints. We hit each with the same 12-prompt suite spanning code, math, reasoning, summarization, and chat; greedy decoding (temperature 0); 256 tokens generated per prompt; and we time the full response end-to-end.
The metric is wall-clock tokens/sec — completion_tokens / elapsed. This matters. A runner's own self-reported generation rate can look nearly identical across all three, because the engine is shared. Wall-clock time is the fair cross-runner metric precisely because it counts whatever overhead each layer adds — the HTTP round-trip, the prompt template, the scheduler — not just the model's internal token rate. It measures what actually reaches you.
Reference test rig. Every number on this page was measured on one machine — an Alienware Aurora R7. CPU: Intel Core i7-8700 (6 cores / 12 threads, Coffee Lake). RAM: 64 GB DDR4-2400. GPU: NVIDIA RTX 5060 Ti 16 GB (Blackwell), running at PCIe 3.0 x8. OS: Windows Server 2025 (build 26100). In-VRAM throughput is GPU-memory-bandwidth bound, so the bus, RAM, and SSD speed barely move it — but they would shape offload and CPU results. Running this yourself?
scripts/make_submission.pyauto-captures your rig into the submission.
One honest limitation up front: this is a single-device result. Only the RTX 5060 Ti was measured. We wanted to include the GTX 1080 Ti and a CPU-only leg, but LM Studio's device-pinning isn't scriptable headless — you can't reliably tell it "use GPU 1 only" from a script without clicking through the GUI. So the multi-device matrix is a gap, not a conclusion, and we're collecting other hardware via pull request.
The results
Here's the whole experiment in one table — overall wall-clock throughput on the RTX 5060 Ti, and each runner's overhead against the raw-llama.cpp ceiling:
| Runner | Overall tok/s | vs llama.cpp | Overhead |
|---|---|---|---|
| llama.cpp (direct) | 77.0 | 1.00× | — |
| LM Studio | 76.8 | 1.00× | 0.3% |
| Ollama | 69.1 | 0.90× | 10.3% |
llama.cpp direct hits 77.0 tok/s. LM Studio scores 76.8 — a gap of 0.2 tok/s, or 0.3%, comfortably within measurement noise on a 12-prompt run. Ollama measures 69.1 tok/s, or 10.3% slower than the raw binary.
What's striking is how consistent the Ollama tax is. It isn't a fluke on one task category — it holds across all five:
| Task | llama.cpp | LM Studio | Ollama |
|---|---|---|---|
| Code | 76.9 | 77.5 | 70.5 |
| Math | 76.8 | 77.1 | 69.6 |
| Reasoning | 77.1 | 77.0 | 69.3 |
| Summarization | 76.4 | 75.0 | 65.3 |
| Chat | 78.2 | 77.1 | 69.8 |
LM Studio lands within about 2% of llama.cpp everywhere — sometimes a hair faster, sometimes a hair slower, which is exactly what noise looks like. Ollama sits 8–14% behind on every single category. That consistency is the tell: a per-task fluke would move around. A flat ~10% gap across the board is the signature of a structural cost — something the wrapper does on every request, regardless of what you ask.
What we learned
LM Studio adds ~0% overhead — it's llama.cpp with a GUI
At 76.8 versus 77.0 tok/s, LM Studio's convenience layer costs you three-tenths of a percent. That is within noise. The GUI, the model downloader, and the OpenAI-compatible server are effectively free from a performance standpoint — LM Studio doesn't insert a meaningful scheduling or buffering layer on top of llama.cpp's own server path. If you've been avoiding it because a GUI "must" be slower than the command line, this is your permission to stop worrying. You get the polish at the raw engine's speed.
Ollama adds ~10% overhead — the packaging is real
At 69.1 tok/s, Ollama runs 7.9 tok/s behind raw llama.cpp. The likely sources are exactly the things Ollama exists to provide: its daemon layer (model lifecycle, context multiplexing, request queuing), its prompt-template engine, and potentially a bundled llama.cpp that's behind the upstream release or running different runtime defaults. None of this is unreasonable engineering for what Ollama promises — zero-config model management, a model hub, multi-model serving. But it has a measurable, repeatable cost.
The gap is packaging, not the model or the engine
All three runners loaded the same GGUF file and ran the same GPU kernels. The performance difference is purely what each wrapper decides to do on top. That's an empowering conclusion, because it means the gap is predictable and probably temporary. As Ollama ships newer llama.cpp builds and tunes its defaults, the overhead should narrow. The engine is not the variable — the wrapper is. If you understand that, you can reason about a runner you haven't even benchmarked yet.
So which runner should you pick?
Here's the practical part — what to actually do with this.
Pick raw llama.cpp if you're benchmarking, scripting a pipeline, or squeezing out maximum throughput, and you don't mind managing flags yourself. It's the ceiling at 77.0 tok/s with full control. The tradeoff is friction: no GUI, no model hub, and you wire up -ngl, flash-attention, and context by hand. If you're standing up an OpenAI-compatible endpoint for an app to call, this is also the cleanest target.
Pick LM Studio if you want a graphical app. At 76.8 tok/s — 0.3% behind raw — you get a polished model downloader, a chat interface, and an OpenAI-compatible server, and on this hardware the convenience is essentially free. For most people who want to click rather than configure, this is the sweet spot.
Pick Ollama if you value its ecosystem: pull-by-name model management, the Modelfile system, the Docker-friendly daemon, and the large community hub. The 10% throughput cost is real, but 69.1 tok/s on a 7B model is still fast and perfectly comfortable for interactive use. If Ollama's UX genuinely saves you setup time — and for a lot of workflows it does — that's a fair trade. Just know you're paying it, and know it'll probably shrink with future releases.
One more thing worth saying plainly: a 10% throughput difference is not what should decide this for a casual user. If Ollama's ollama pull workflow gets you running in two minutes and raw llama.cpp would cost you an afternoon of flag-tuning, the afternoon dwarfs the 10%. Optimize for your friction, not the benchmark — the benchmark just tells you the price tag so you can choose with open eyes. For a feature-by-feature breakdown beyond raw speed, see our fuller Ollama vs LM Studio vs llama.cpp comparison, and if you're sizing a machine, what tokens-per-second to expect locally goes deeper on why this workload is bandwidth-bound in the first place.
Why is this workload bandwidth-bound, and why does that matter for the wrapper question? Single-user decoding reads the entire model from memory on every token, so you're limited by memory bandwidth, not raw compute. That's why the engine is hard to beat and why the wrapper is where the only real variance lives — everyone's hitting the same memory wall, so the gap is whatever happens before the request reaches it.
Reproduce this — and send us your hardware
Everything here runs on a stock setup with the same public Qwen2.5-Coder-7B GGUF served three ways. To reproduce it, pin all three runners to the same device (the same model and quant), then run the suite tagged with that device:
# pin each runner to the target device, then:
python scripts/bench.py --device 5060ti
python scripts/aggregate.py && python scripts/inject.py
This is where you come in. This is experiment #2 in our open-source local-LLM benchmark series, and the single-device result above is begging to be filled in. We measured one card; the relative ordering may shift on other hardware — especially if Ollama's bundled llama.cpp has better support for an older or slower device than the build we hit. An RTX 4090? An Apple Silicon Mac (where LM Studio's MLX path comes into play)? An old 1080 Ti? A pure CPU box? Every data point sharpens the picture.
Run scripts/bench.py, let scripts/make_submission.py auto-capture your rig, and open a pull request adding just your results file. Full per-runner pinning recipes and submission instructions live in the experiment's README and the repo's CONTRIBUTING guide. Code is MIT, submitted data is CC BY 4.0, and we'll fold every run into a growing community comparison. If your numbers confirm the 10% tax — or upend it on hardware we didn't test — that's exactly the kind of result worth sharing.
New to running models on your own machine? Start with our guides on running an LLM locally, choosing between Ollama, LM Studio, and llama.cpp, and llama.cpp speculative decoding on consumer GPUs.
