This is the hub article for our local-AI series. Each section links to a deep-dive. Start here, then follow the thread that matches your question.
Why run AI locally?
There are four concrete reasons, and none of them are hype.
Privacy. Your prompts and the model's outputs never leave the machine. For source code, customer records, legal documents, or PHI, that is the difference between "we use AI" and "we can't use AI."
Cost. On hardware you already own, inference has no per-token fee. A cloud call to GPT-5.5 costs $5 per million input tokens and $30 per million output; Claude Opus 4.8 is $5 / $25. At volume those numbers compound. The self-hosted LLM cost calculator shows where your break-even against cloud APIs actually falls — it's usually sooner than people expect for high-volume, repetitive workloads.
Control. No rate limits, no surprise deprecations (the cloud lineup churns constantly — o3 and the original GPT-5 are already being retired), no model swapped out from under your prompts. It runs offline, on a plane, in an air-gapped facility.
The honest trade-off: you own the hardware and the ops, and you are capped at what your GPU can hold. A 24 GB RTX 4090 runs an 8B model beautifully but cannot fit a 70B at Q4 without spilling to system RAM (which is slow). Knowing where that ceiling sits is most of the battle — and it's what the rest of this guide is about.
The four pieces of a local-AI stack
Every local-AI setup is the same four layers. Get the mental model and everything else falls into place: you pick a runtime, point it at a model packaged in a format, and run it on your hardware. Each layer constrains the one below it.
1. The runtime (what you install)
The runtime is the program that loads weights into memory and answers requests. The four that matter are Ollama, LM Studio, llama.cpp, and vLLM. They share one superpower: every one of them exposes an OpenAI-compatible API, so the same client code works against any of them (and against the cloud). Full comparison in Ollama vs LM Studio vs llama.cpp; the quick decision table:
| Runtime | Best for | Interface | API port | Native format | Apple Silicon |
|---|---|---|---|---|---|
| Ollama | Fastest start, CLI + scripting | Command line + REST | 11434 (/v1 OpenAI) | GGUF | MLX backend (0.19+, 32 GB+) |
| LM Studio | Polished desktop GUI | App + local server | 1234 (/v1, also /v1/messages) | GGUF | MLX (default when available) |
| llama.cpp | Max control, the raw engine | CLI + llama-server | 8080 (configurable) | GGUF (originated it) | Metal / MLX-backed builds |
| vLLM | High-throughput, many users | Python server | 8000 (/v1 OpenAI) | safetensors / AWQ / GPTQ / FP8 | Not the target (datacenter GPU) |
One gotcha worth internalizing now: Ollama defaults to a 2048-token context window. For real coding or long-document work you must raise it (OLLAMA_CONTEXT_LENGTH=8192 or higher) or the model silently truncates. The ollama-config-generator sets this correctly for you.
2. The model & its format
The model is the weights (Llama 4, DeepSeek-V3, Qwen, Mistral, Gemma). Its size in parameters — 8B, 13B, 70B — is the single biggest driver of how much memory you need. The format is how those weights are packaged on disk. For desktop runtimes that's almost always GGUF, the quantized container originated by the llama.cpp project and now the de-facto standard. vLLM is the exception: it wants HuggingFace safetensors (FP16/BF16, AWQ, GPTQ, FP8) and treats GGUF as experimental, single-file-only, and explicitly recommends you use llama.cpp instead if GGUF is all you have. On Macs there's also Apple's native MLX format, now the high-performance backend under both Ollama and LM Studio.
3. Quantization (the size/quality dial)
Quantization shrinks weights from 16-bit floats down to 8, 6, 5, 4, even 2 bits each. Fewer bits per weight means a smaller file and less VRAM, at some cost to quality. The GGUF naming (Q4_K_M, Q5_K_M, Q8_0…) encodes exactly how aggressive the squeeze is. The headline numbers:
| Quant | Effective bytes/weight | 8B weights | 70B weights | Quality vs FP16 |
|---|---|---|---|---|
| FP16 / BF16 | 2.0 | ~16 GB | ~140 GB | Reference (lossless) |
| Q8_0 | ~1.06 | ~8.5 GB | ~74 GB | Indistinguishable |
| Q6_K | ~0.82 | ~6.6 GB | ~57 GB | Near-lossless |
| Q5_K_M | ~0.71 | ~5.7 GB | ~50 GB | Near-lossless |
| Q4_K_M ⭐ | ~0.55 | ~4.4–4.9 GB | ~38–43 GB | +1–3% perplexity — the default |
| Q3_K_M | ~0.49 | ~3.9 GB | ~34 GB | Usable, noticeably weaker |
| Q2_K | ~0.41 | ~3.3 GB | ~29 GB | Clearly degraded (last resort) |
Q4_K_M is the right default for almost everyone — about +3.3% perplexity on Llama-3.1-8B (ppl ~7.56 vs ~7.32 FP16) for roughly a quarter of the FP16 size. Quality stays mild down to ~4-bit, then degrades sharply below it. Full decoder ring (what _K, _S/_M/_L, and the IQ i-quants mean) in the LLM quantization deep-dive.
4. The hardware (will it run?)
This is the question everyone actually has. Total memory needed is weights + KV cache + ~1–2 GB overhead, and the weights number comes straight from the table above. The fastest way to answer it: run what-llm-can-i-run — it detects your GPU in one click and tells you which models fit — or punch a specific model and GPU into the llm-vram-calculator. Both are covered in the How much VRAM do you need? deep-dive.
How to actually get started
The 10-minute path on most machines: install Ollama, pull a model, chat. Raise the context window first so you don't get silently truncated.
# 1. install (macOS/Linux) and pull an 8B model in Q4_K_M
ollama pull llama3.1:8b
# 2. give it a real context window, not the 2048 default
OLLAMA_CONTEXT_LENGTH=8192 ollama serve &
# 3. talk to it — or hit the OpenAI-compatible API on :11434/v1
ollama run llama3.1:8b "Summarize the attached changelog in five bullets."
If you'd rather not memorize flags, the ollama-command-builder generates the exact run/pull/create commands and Modelfiles for you. Prefer a GUI with no terminal at all? Install LM Studio, search a model, click download, click load. Want to try before installing anything? The local-ai-chat tool runs a model entirely in your browser via WebGPU. Step-by-step walkthrough in How to run an LLM locally.
Sizing your hardware
Weights are the easy part. The piece people forget is the KV cache — the memory that holds attention state for every token in the context window, and it grows linearly with context length:
kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem
For a Llama-3 70B-class model (80 layers, 8 KV heads via GQA, head_dim 128, FP16 KV) that's ~0.3125 MB per token:
| Context | KV cache (FP16) | KV cache (FP8/INT8) |
|---|---|---|
| 4K | ~1.25 GB | ~0.625 GB |
| 32K | ~10 GB | ~5 GB |
| 128K | ~40 GB | ~20 GB |
At 128K context the KV cache alone (~40 GB) is as large as the quantized weights. Two levers tame it: GQA (grouped-query attention — Llama-3 70B's 64 query heads collapse to 8 KV heads, an 8× reduction; without it, 128K would need ~320 GB) and quantized KV cache (FP8/INT8 K and V halves it again). On a Mac, unified memory changes the math entirely — the GPU can address up to 128 GB, so an M4 Max can hold a 70B model that no single consumer NVIDIA card can. Use the llm-vram-calculator to combine weights + KV + overhead for your exact context length; how much VRAM walks the formula end to end.
When one GPU isn't enough
When a model won't fit one card, you have three escalating options.
Multiple GPUs in one box. Text transformers shard cleanly because they're a stack of identical decoder layers — big, regular matmuls with an obvious split dimension. llama.cpp's --split-mode does this (layer splits layers across GPUs, row/tensor splits the weight matrices); vLLM does tensor parallelism natively. Worth knowing: this is why LLMs split well and diffusion image/video models don't — their U-Net graphs have skip connections and a serial denoise loop that wreck tensor parallelism. Details in splitting LLM models across GPUs.
Multiple machines. llama.cpp's RPC backend pipeline-parallelizes a model across N hosts; exo clusters heterogeneous devices (Macs, PCs, even Raspberry Pis) peer-to-peer, pooling memory to fit models too large for any one box — e.g. DeepSeek-V3 671B at 8-bit across 8× M4 Mac minis. The catch: clustering buys you capacity, not speed (throughput is bounded by the slowest link). See clustering machines for local AI.
The operational answer. Hardware aside, the real production question is: how do my apps reach this? The clean answer is one OpenAI-compatible endpoint in front of your hardware. Every runtime here speaks that protocol, so you point your app at a single URL and swap models underneath. When you outgrow a single node — you need failover, edge caching, or burst capacity for the occasional hard request — that endpoint is where a gateway like WideAreaAI (an edge-first AI gateway that routes each request: edge cache → your own llama.cpp hardware → cloud burst failover) fits in. Own your baseline, burst to the cloud — edge-first, cloud when you choose. The endpoint pattern itself is covered in an OpenAI-compatible endpoint for your local LLM and edge caching for LLM requests.
What performance to expect
Decode speed is memory-bandwidth-bound, not FLOPs-bound. The mental model:
tok/s ≈ memory_bandwidth ÷ active_bytes_read_per_token (real-world ≈ 50–80% of this)
A dense model reads roughly all its weight bytes per token, so doubling bandwidth roughly doubles tok/s. Ballpark single-stream decode (Q4):
| GPU | Bandwidth | 8B Q4 (~5 GB) | 70B Q4 (~40 GB) |
|---|---|---|---|
| RTX 4090 (24 GB) | ~1008 GB/s | ~90–140 tok/s | doesn't fit → offload ~15–20 |
| RTX 5090 (32 GB) | ~1792 GB/s | ~140–220 tok/s | needs 2× → ~30–45 |
| A100 80GB | ~2039 GB/s | ~120–200 tok/s | ~25–40 tok/s |
| H100 80GB | ~3350 GB/s | ~180–280 tok/s | ~40–60 tok/s |
| Apple M4 Max | ~546 GB/s | ~40–80 tok/s | ~8–13 tok/s |
These are order-of-magnitude estimates — actual numbers swing with engine, batch size, flash-attention, quant kernel, and context length. To pin yours down, the llm-inference-speed-calculator estimates tok/s from bandwidth/model/quant, and llm-gpu-benchmark measures your real WebGPU bandwidth and throughput in-browser. Deep-dive: what performance to expect from a local LLM.
Local AI for businesses & regulated industries
For an MSP, a law firm, a clinic, or a financial-services team, "the prompt never leaves our network" is not a feature — it's the entry ticket. On-prem local AI keeps PHI, privileged documents, and source code inside your perimeter, satisfies data-residency requirements, and removes the per-token meter for high-volume internal workloads. The realistic architecture for most regulated shops is hybrid: a local baseline on owned hardware for the 95% of requests that are routine, with a deliberate, audited path to a cloud model for the hard 5% — rather than sending everything to a third party by default.
That hybrid pattern is exactly where an edge-first AI gateway earns its place: WideAreaAI gives your apps one OpenAI-compatible LLM endpoint and routes each request edge cache → your own GPU/llama.cpp node (a markup-free baseline with no per-token fees on hardware you own) → cloud burst only when you choose. It does request-level routing, failover, and edge caching across whole nodes — not model-splitting across machines. Full compliance framing in on-prem AI for regulated industries.
The local-AI toolkit (free tools)
Every tool below runs in your browser — no signup, nothing uploaded. They map directly onto the four layers above.
| Tool | What it answers |
|---|---|
| what-llm-can-i-run | One-click GPU detection → which LLMs you can run |
| llm-vram-calculator | VRAM any model needs; GPU fit, multi-GPU, Apple Silicon |
| fine-tuning-vram-calculator | VRAM for full / LoRA / QLoRA training |
| llm-inference-speed-calculator | Estimate tok/s from bandwidth/model/quant; compare GPUs |
| llm-gpu-benchmark | Real in-browser WebGPU bandwidth + tokens/sec |
| self-hosted-llm-cost-calculator | Cloud API cost vs own-hardware break-even |
| llm-token-counter | Token counts + cost + context-fit across models |
| ollama-command-builder | Build ollama run/pull/create commands + Modelfiles |
| ollama-config-generator | Modelfile + server env + OpenAI-compatible snippet |
| local-ai-chat | In-browser WebGPU chat (try a model with zero install) |
| private-ai-summarizer | In-browser WebGPU summarizer (text never leaves the tab) |
Conclusion
Local AI comes down to four layers — runtime, model, format, hardware — and one question at each: what do I install, which weights, how aggressively quantized, and will it fit. Get those right and you have private, unmetered, offline AI on hardware you control, with a clean OpenAI-compatible endpoint your apps already know how to call.
Pick your next thread:
- Choosing a runtime? → Ollama vs LM Studio vs llama.cpp
- Sizing hardware? → How much VRAM do you need? + llm-vram-calculator
- Squeezing a model smaller? → LLM quantization explained
- Just want it running? → How to run an LLM locally
- Scaling past one GPU? → Splitting across GPUs · Clustering machines
- Putting it in production? → An OpenAI-compatible endpoint · On-prem AI for regulated industries