There are at least ten serious ways to run an LLM on your own hardware, and they all describe themselves the same way: "run powerful AI models locally." That's true of all of them and helpful for choosing none of them.
Here's the comparison that actually matters: what each tool is, who it's for, and the handful of real differences hiding under the marketing.
Before reading about runners — check what your hardware can actually run. The tool below detects your GPU and ranks 55 popular models by whether they'll work on your machine:
What LLM Can I Run?
Detect your GPU with one click and see which LLMs your computer can actually run — Llama, Gemma, Qwen, DeepSeek and 50+ more, ranked by whether they fit in your VRAM, need CPU offloading, or won't run at all.
Open the full What LLM Can I Run? →The Mental Model: Engines, Apps, and Servers
Every tool in this space falls into one of three layers, and most confusion comes from comparing across layers:
Engines do the actual math. llama.cpp (C++, runs everywhere) and MLX (Apple's framework for Apple Silicon) are the two that matter for consumer hardware. You can use them directly, but most people don't.
Apps wrap an engine in a friendly experience. Ollama, LM Studio, Jan, GPT4All, and KoboldCpp are all wrappers around llama.cpp (and increasingly MLX on Macs). When you compare "Ollama vs LM Studio performance," you're mostly comparing the same engine with different paint.
Serving systems are built for many simultaneous users. vLLM (and friends like SGLang and TensorRT-LLM) trade simplicity for throughput — they're what you graduate to when your local project becomes a production service.
One sentence from this section worth remembering: Ollama and LM Studio are experience layers; llama.cpp and MLX are engines; vLLM is a serving system.
The Comparison Table
| Tool | Interface | Engine | Best for | API server | Multi-user |
|---|---|---|---|---|---|
| Ollama | CLI | llama.cpp + MLX | Developers, background server | OpenAI-compatible (port 11434) | Limited |
| LM Studio | GUI (+ lms CLI) | llama.cpp + MLX | Model discovery, desktop chat | OpenAI-compatible (port 1234) | Limited |
| llama.cpp | CLI / library | — (it is the engine) | Maximum control, embedded use | Built-in (llama-server) | Limited |
| vLLM | Python / Docker | Custom (PagedAttention) | Production serving | OpenAI-compatible | Excellent |
| Jan | GUI | llama.cpp | Privacy-focused desktop chat | OpenAI-compatible | No |
| GPT4All | GUI | llama.cpp | Document Q&A (built-in RAG) | OpenAI-compatible | No |
| KoboldCpp | GUI (browser) | llama.cpp | Creative writing, roleplay | Own API + OpenAI-compatible | No |
| LocalAI | Server | Multiple backends | API gateway over many backends | OpenAI-compatible | Moderate |
| llamafile | Single executable | llama.cpp | Zero-install portability | Built-in | No |
| mlx-lm | CLI / Python | MLX | Maximum Apple Silicon speed | Basic | No |
Ollama: The Default for Developers
Ollama won the developer mindshare war by copying Docker's interface: ollama pull llama3.1, ollama run llama3.1, done. It runs as a background service with an OpenAI-compatible API, which means every AI app, IDE plugin, and framework that speaks "OpenAI" can point at your machine instead.
What's genuinely good: the three-minute setup; the model library with sane default quantizations; Modelfiles for packaging custom system prompts; and on Macs, the new MLX backend that roughly doubled generation speed on recent M-series chips.
What to know: Ollama defaults to a small context window (often 4K) regardless of what the model supports — the #1 cause of "why is it forgetting my conversation." It also exposes relatively few tuning knobs; that's the price of simplicity.
Pick Ollama if: you're a developer, you want models available to other apps via API, or you want the shortest path from zero to working.
Skip the syntax memorization — this builds your Ollama commands, Modelfiles, and server config interactively:
Ollama Command Builder
Build Ollama commands without memorizing syntax: run, pull, and create commands for any model, complete Modelfiles with parameters and system prompts, and server environment configuration — with built-in VRAM checks.
Open the full Ollama Command Builder →LM Studio: The Best Way to Browse Models
LM Studio is what you show someone who's never run a local model. It's a polished desktop app whose killer feature is the built-in model browser: search Hugging Face from inside the app, and it tells you which quantization of which model will fit your hardware before you download multiple gigabytes.
What's genuinely good: the hardware-aware download recommendations; per-model configuration through a GUI instead of flags; the lms CLI for headless use once you outgrow the GUI; and an OpenAI-compatible local server, so it can do double duty as a development backend.
What to know: it's not open source (free for personal and commercial use, but the code is closed). The GUI also consumes a bit of memory you might prefer to give to the model on RAM-constrained machines.
Pick LM Studio if: you want a GUI, you're not sure which models or quantizations to try, or you're on a Mac and want MLX speed without touching a terminal.
llama.cpp: The Engine Itself
Everything above runs on llama.cpp under the hood. Running it directly gets you: every tuning flag that exists (batch size, RoPE scaling, KV cache quantization, exact GPU layer splitting across mismatched cards), support for hardware nothing else supports, and zero overhead from wrapper layers.
What it costs you: you compile it (or grab releases), manage GGUF files yourself, and read documentation that assumes you know what --rope-freq-scale means.
Pick llama.cpp if: you need a tuning flag the wrappers don't expose, you're embedding inference in your own software, or you're squeezing the last 10% out of unusual hardware.
vLLM: When It's Not Just You Anymore
Everything above is built for one user at a time. vLLM is built for throughput: its PagedAttention memory management and continuous batching serve many concurrent requests from the same GPU — benchmarks consistently show 16-20x Ollama's multi-user throughput, turning 4-second response times under load into 250ms.
What it costs you: a Python environment, a real NVIDIA or AMD GPU, no GGUF support (it uses its own quantization formats like AWQ and GPTQ), and ~30 minutes of setup instead of 3.
Pick vLLM if: you're serving an application with real users, you're batch-processing thousands of documents, or "tokens per second per dollar" appears in your planning documents. Our self-hosted LLM cost calculator assumes vLLM-class serving when computing whether self-hosting beats APIs at high volume.
The Rest of the Field, Honestly
Jan — open-source LM Studio alternative with 160K+ GitHub stars. The most privacy-conscious of the GUI options (offline-first by design). Pick it over LM Studio if open source matters to you.
GPT4All — desktop app whose differentiator is LocalDocs: point it at a folder of PDFs and chat with them, fully offline. The 2026 release added on-device reasoning with tool calling. Pick it for private document Q&A without building a RAG pipeline.
KoboldCpp — purpose-built for creative writing and roleplay, with context-management features (World Info, Author's Note, Memory) that chat-focused tools lack. The fiction-writing community's standard.
LocalAI — not a runner but a router: one OpenAI-compatible endpoint in front of multiple backends (llama.cpp, vLLM, image models, audio models). Pick it when you're orchestrating several model types behind one API.
llamafile — an entire model packed into a single executable. Double-click, chat. No install, no dependencies. Perfect for handing a model to someone on a USB stick; not built for daily driving.
mlx-lm — Apple's own inference tooling. The fastest option on Apple Silicon, but command-line only and minimal. Most Mac users get MLX speed through LM Studio or Ollama instead, which now use it as a backend.
The Decision Framework
Answer three questions:
1. Who's using it?
- Just me, interactively → Ollama (developers) or LM Studio (everyone else)
- Just me, but other apps connect to it → Ollama
- Multiple people or an application → vLLM
2. What hardware?
- Apple Silicon Mac → LM Studio or Ollama (both MLX-accelerated now)
- NVIDIA gaming PC → any of them; Ollama is the easy default
- Server with datacenter GPUs → vLLM
- Potato → llamafile with a small model, or reconsider (check what your machine can run)
3. What's the actual job?
- Coding assistant backend → Ollama
- Exploring what local models can do → LM Studio
- Chat with my documents → GPT4All
- Creative writing → KoboldCpp
- Production API → vLLM
- Embedded in my own product → llama.cpp
The most common correct answer is "Ollama and LM Studio, both" — LM Studio to find and evaluate models, Ollama to serve the winner to your other tools. They coexist fine; just remember each keeps its own copy of downloaded models.
Before You Download Anything
Whichever runner you pick, the constraint that actually determines your experience is hardware — specifically how much GPU/unified memory you have and its bandwidth. A perfect runner with a model that doesn't fit gives you 2 tokens per second of frustration.
Three of our tools answer the hardware questions in order:
- What LLM Can I Run? — detect your GPU, see what fits
- LLM VRAM Calculator — exact memory math for any model, quantization, and context size
- LLM Inference Speed Calculator — what tokens/sec to expect before you download 40 GB to find out
Then grab Ollama or LM Studio, pull the best model from your "runs great" tier, and you're running AI on your own hardware in ten minutes.