Why run an LLM locally instead of using a cloud API?

Privacy (prompts and outputs never leave your machine), zero per-token cost once you own the hardware, no rate limits, offline operation, and full data-residency/compliance control. The trade-offs are real: you manage the hardware, and you cap out at whatever your GPU or unified memory can hold. A 24 GB card runs an 8B model comfortably but won't fit a 70B at Q4 without offloading or a second GPU.

What do I actually need to run a local LLM?

Three things: a runtime (Ollama, LM Studio, llama.cpp, or vLLM), a model file in a supported format (usually GGUF for desktop runtimes, safetensors for vLLM), and enough VRAM or unified memory for the model weights plus the KV cache for your context length plus 1–2 GB of runtime overhead. Use the VRAM calculator and 'What LLM Can I Run?' tools to size it before you download a 40 GB file.

How much VRAM do I need?

Roughly: (parameters × bytes-per-weight from your quantization) + KV cache for your context + ~1–2 GB overhead. An 8B model at Q4_K_M is about 4.4–4.9 GB of weights; a 70B at Q4_K_M is ~38–43 GB and needs about 48 GB total at modest context — so a single 48 GB card or two 24 GB GPUs. The VRAM article and llm-vram-calculator do this math for any model and GPU.

Is a local LLM as good as GPT-5.5 or Claude Opus?

On the hardest reasoning and coding tasks, frontier cloud models still lead. But strong open-weight models like Llama 4 and DeepSeek-V3 are genuinely capable for summarization, extraction, classification, RAG, and most day-to-day work. The local win is privacy, cost, and control — and a hybrid setup (local baseline, cloud burst for the hard 5%) captures both sides.

Which runtime should I start with?

Ollama if you want a one-command CLI and an OpenAI-compatible API on port 11434. LM Studio if you want a polished desktop GUI. llama.cpp if you want maximum control and the raw engine. vLLM if you're serving many concurrent users on datacenter GPUs. Most people should start with Ollama or LM Studio and graduate to llama.cpp or vLLM only when they hit a wall.

Does running AI locally work on a Mac?

Yes, and Apple Silicon is one of the best value paths because unified memory lets the GPU address up to 128 GB. An M4 Max runs an 8B model at ~40–80 tok/s and can fit a 70B in memory (at ~8–13 tok/s). Both Ollama (0.19+) and LM Studio now use Apple's MLX backend on Apple Silicon for a meaningful speedup over the old Metal path.

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

This is the hub article for our local-AI series. Each section links to a deep-dive. Start here, then follow the thread that matches your question.

Why run AI locally?

There are four concrete reasons, and none of them are hype.

Privacy. Your prompts and the model's outputs never leave the machine. For source code, customer records, legal documents, or PHI, that is the difference between "we use AI" and "we can't use AI."

Cost. On hardware you already own, inference has no per-token fee. A cloud call to GPT-5.5 costs $5 per million input tokens and $30 per million output; Claude Opus 4.8 is $5 / $25. At volume those numbers compound. The self-hosted LLM cost calculator shows where your break-even against cloud APIs actually falls — it's usually sooner than people expect for high-volume, repetitive workloads.

Control. No rate limits, no surprise deprecations (the cloud lineup churns constantly — o3 and the original GPT-5 are already being retired), no model swapped out from under your prompts. It runs offline, on a plane, in an air-gapped facility.

The honest trade-off: you own the hardware and the ops, and you are capped at what your GPU can hold. A 24 GB RTX 4090 runs an 8B model beautifully but cannot fit a 70B at Q4 without spilling to system RAM (which is slow). Knowing where that ceiling sits is most of the battle — and it's what the rest of this guide is about.

The four pieces of a local-AI stack

Every local-AI setup is the same four layers. Get the mental model and everything else falls into place: you pick a runtime, point it at a model packaged in a format, and run it on your hardware. Each layer constrains the one below it.

1. The runtime (what you install)

The runtime is the program that loads weights into memory and answers requests. The four that matter are Ollama, LM Studio, llama.cpp, and vLLM. They share one superpower: every one of them exposes an OpenAI-compatible API, so the same client code works against any of them (and against the cloud). Full comparison in Ollama vs LM Studio vs llama.cpp; the quick decision table:

Runtime	Best for	Interface	API port	Native format	Apple Silicon
Ollama	Fastest start, CLI + scripting	Command line + REST	11434 (`/v1` OpenAI)	GGUF	MLX backend (0.19+, 32 GB+)
LM Studio	Polished desktop GUI	App + local server	1234 (`/v1`, also `/v1/messages`)	GGUF	MLX (default when available)
llama.cpp	Max control, the raw engine	CLI + `llama-server`	8080 (configurable)	GGUF (originated it)	Metal / MLX-backed builds
vLLM	High-throughput, many users	Python server	8000 (`/v1` OpenAI)	safetensors / AWQ / GPTQ / FP8	Not the target (datacenter GPU)

One gotcha worth internalizing now: Ollama defaults to a 2048-token context window. For real coding or long-document work you must raise it (OLLAMA_CONTEXT_LENGTH=8192 or higher) or the model silently truncates. The ollama-config-generator sets this correctly for you.

2. The model & its format

The model is the weights (Llama 4, DeepSeek-V3, Qwen, Mistral, Gemma). Its size in parameters — 8B, 13B, 70B — is the single biggest driver of how much memory you need. The format is how those weights are packaged on disk. For desktop runtimes that's almost always GGUF, the quantized container originated by the llama.cpp project and now the de-facto standard. vLLM is the exception: it wants HuggingFace safetensors (FP16/BF16, AWQ, GPTQ, FP8) and treats GGUF as experimental, single-file-only, and explicitly recommends you use llama.cpp instead if GGUF is all you have. On Macs there's also Apple's native MLX format, now the high-performance backend under both Ollama and LM Studio.

3. Quantization (the size/quality dial)

Quantization shrinks weights from 16-bit floats down to 8, 6, 5, 4, even 2 bits each. Fewer bits per weight means a smaller file and less VRAM, at some cost to quality. The GGUF naming (Q4_K_M, Q5_K_M, Q8_0…) encodes exactly how aggressive the squeeze is. The headline numbers:

Quant	Effective bytes/weight	8B weights	70B weights	Quality vs FP16
FP16 / BF16	2.0	~16 GB	~140 GB	Reference (lossless)
Q8_0	~1.06	~8.5 GB	~74 GB	Indistinguishable
Q6_K	~0.82	~6.6 GB	~57 GB	Near-lossless
Q5_K_M	~0.71	~5.7 GB	~50 GB	Near-lossless
Q4_K_M ⭐	~0.55	~4.4–4.9 GB	~38–43 GB	+1–3% perplexity — the default
Q3_K_M	~0.49	~3.9 GB	~34 GB	Usable, noticeably weaker
Q2_K	~0.41	~3.3 GB	~29 GB	Clearly degraded (last resort)

Q4_K_M is the right default for almost everyone — about +3.3% perplexity on Llama-3.1-8B (ppl ~7.56 vs ~7.32 FP16) for roughly a quarter of the FP16 size. Quality stays mild down to ~4-bit, then degrades sharply below it. Full decoder ring (what _K, _S/_M/_L, and the IQ i-quants mean) in the LLM quantization deep-dive.

4. The hardware (will it run?)

This is the question everyone actually has. Total memory needed is weights + KV cache + ~1–2 GB overhead, and the weights number comes straight from the table above. The fastest way to answer it: run what-llm-can-i-run — it detects your GPU in one click and tells you which models fit — or punch a specific model and GPU into the llm-vram-calculator. Both are covered in the How much VRAM do you need? deep-dive.

How to actually get started

The 10-minute path on most machines: install Ollama, pull a model, chat. Raise the context window first so you don't get silently truncated.

# 1. install (macOS/Linux) and pull an 8B model in Q4_K_M
ollama pull llama3.1:8b

# 2. give it a real context window, not the 2048 default
OLLAMA_CONTEXT_LENGTH=8192 ollama serve &

# 3. talk to it — or hit the OpenAI-compatible API on :11434/v1
ollama run llama3.1:8b "Summarize the attached changelog in five bullets."

If you'd rather not memorize flags, the ollama-command-builder generates the exact run/pull/create commands and Modelfiles for you. Prefer a GUI with no terminal at all? Install LM Studio, search a model, click download, click load. Want to try before installing anything? The local-ai-chat tool runs a model entirely in your browser via WebGPU. Step-by-step walkthrough in How to run an LLM locally.

Sizing your hardware

Weights are the easy part. The piece people forget is the KV cache — the memory that holds attention state for every token in the context window, and it grows linearly with context length:

kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem

For a Llama-3 70B-class model (80 layers, 8 KV heads via GQA, head_dim 128, FP16 KV) that's ~0.3125 MB per token:

Context	KV cache (FP16)	KV cache (FP8/INT8)
4K	~1.25 GB	~0.625 GB
32K	~10 GB	~5 GB
128K	~40 GB	~20 GB

At 128K context the KV cache alone (~40 GB) is as large as the quantized weights. Two levers tame it: GQA (grouped-query attention — Llama-3 70B's 64 query heads collapse to 8 KV heads, an 8× reduction; without it, 128K would need ~320 GB) and quantized KV cache (FP8/INT8 K and V halves it again). On a Mac, unified memory changes the math entirely — the GPU can address up to 128 GB, so an M4 Max can hold a 70B model that no single consumer NVIDIA card can. Use the llm-vram-calculator to combine weights + KV + overhead for your exact context length; how much VRAM walks the formula end to end.

When one GPU isn't enough

When a model won't fit one card, you have three escalating options.

Multiple GPUs in one box. Text transformers shard cleanly because they're a stack of identical decoder layers — big, regular matmuls with an obvious split dimension. llama.cpp's --split-mode does this (layer splits layers across GPUs, row/tensor splits the weight matrices); vLLM does tensor parallelism natively. Worth knowing: this is why LLMs split well and diffusion image/video models don't — their U-Net graphs have skip connections and a serial denoise loop that wreck tensor parallelism. Details in splitting LLM models across GPUs.

Multiple machines. llama.cpp's RPC backend pipeline-parallelizes a model across N hosts; exo clusters heterogeneous devices (Macs, PCs, even Raspberry Pis) peer-to-peer, pooling memory to fit models too large for any one box — e.g. DeepSeek-V3 671B at 8-bit across 8× M4 Mac minis. The catch: clustering buys you capacity, not speed (throughput is bounded by the slowest link). See clustering machines for local AI.

The operational answer. Hardware aside, the real production question is: how do my apps reach this? The clean answer is one OpenAI-compatible endpoint in front of your hardware. Every runtime here speaks that protocol, so you point your app at a single URL and swap models underneath. When you outgrow a single node — you need failover, edge caching, or burst capacity for the occasional hard request — that endpoint is where a gateway like WideAreaAI (an edge-first AI gateway that routes each request: edge cache → your own llama.cpp hardware → cloud burst failover) fits in. Own your baseline, burst to the cloud — edge-first, cloud when you choose. The endpoint pattern itself is covered in an OpenAI-compatible endpoint for your local LLM and edge caching for LLM requests.

What performance to expect

Decode speed is memory-bandwidth-bound, not FLOPs-bound. The mental model:

tok/s ≈ memory_bandwidth ÷ active_bytes_read_per_token   (real-world ≈ 50–80% of this)

A dense model reads roughly all its weight bytes per token, so doubling bandwidth roughly doubles tok/s. Ballpark single-stream decode (Q4):

GPU	Bandwidth	8B Q4 (~5 GB)	70B Q4 (~40 GB)
RTX 4090 (24 GB)	~1008 GB/s	~90–140 tok/s	doesn't fit → offload ~15–20
RTX 5090 (32 GB)	~1792 GB/s	~140–220 tok/s	needs 2× → ~30–45
A100 80GB	~2039 GB/s	~120–200 tok/s	~25–40 tok/s
H100 80GB	~3350 GB/s	~180–280 tok/s	~40–60 tok/s
Apple M4 Max	~546 GB/s	~40–80 tok/s	~8–13 tok/s

These are order-of-magnitude estimates — actual numbers swing with engine, batch size, flash-attention, quant kernel, and context length. To pin yours down, the llm-inference-speed-calculator estimates tok/s from bandwidth/model/quant, and llm-gpu-benchmark measures your real WebGPU bandwidth and throughput in-browser. Deep-dive: what performance to expect from a local LLM.

Local AI for businesses & regulated industries

For an MSP, a law firm, a clinic, or a financial-services team, "the prompt never leaves our network" is not a feature — it's the entry ticket. On-prem local AI keeps PHI, privileged documents, and source code inside your perimeter, satisfies data-residency requirements, and removes the per-token meter for high-volume internal workloads. The realistic architecture for most regulated shops is hybrid: a local baseline on owned hardware for the 95% of requests that are routine, with a deliberate, audited path to a cloud model for the hard 5% — rather than sending everything to a third party by default.

That hybrid pattern is exactly where an edge-first AI gateway earns its place: WideAreaAI gives your apps one OpenAI-compatible LLM endpoint and routes each request edge cache → your own GPU/llama.cpp node (a markup-free baseline with no per-token fees on hardware you own) → cloud burst only when you choose. It does request-level routing, failover, and edge caching across whole nodes — not model-splitting across machines. Full compliance framing in on-prem AI for regulated industries.

The local-AI toolkit (free tools)

Every tool below runs in your browser — no signup, nothing uploaded. They map directly onto the four layers above.

Tool	What it answers
what-llm-can-i-run	One-click GPU detection → which LLMs you can run
llm-vram-calculator	VRAM any model needs; GPU fit, multi-GPU, Apple Silicon
fine-tuning-vram-calculator	VRAM for full / LoRA / QLoRA training
llm-inference-speed-calculator	Estimate tok/s from bandwidth/model/quant; compare GPUs
llm-gpu-benchmark	Real in-browser WebGPU bandwidth + tokens/sec
self-hosted-llm-cost-calculator	Cloud API cost vs own-hardware break-even
llm-token-counter	Token counts + cost + context-fit across models
ollama-command-builder	Build `ollama run`/`pull`/`create` commands + Modelfiles
ollama-config-generator	Modelfile + server env + OpenAI-compatible snippet
local-ai-chat	In-browser WebGPU chat (try a model with zero install)
private-ai-summarizer	In-browser WebGPU summarizer (text never leaves the tab)

Conclusion

Local AI comes down to four layers — runtime, model, format, hardware — and one question at each: what do I install, which weights, how aggressively quantized, and will it fit. Get those right and you have private, unmetered, offline AI on hardware you control, with a clean OpenAI-compatible endpoint your apps already know how to call.

Pick your next thread:

Choosing a runtime? → Ollama vs LM Studio vs llama.cpp
Sizing hardware? → How much VRAM do you need? + llm-vram-calculator
Squeezing a model smaller? → LLM quantization explained
Just want it running? → How to run an LLM locally
Scaling past one GPU? → Splitting across GPUs · Clustering machines
Putting it in production? → An OpenAI-compatible endpoint · On-prem AI for regulated industries

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

Why run AI locally?

The four pieces of a local-AI stack

1. The runtime (what you install)

2. The model & its format

3. Quantization (the size/quality dial)

4. The hardware (will it run?)

How to actually get started

Sizing your hardware

When one GPU isn't enough

What performance to expect

Local AI for businesses & regulated industries

The local-AI toolkit (free tools)

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

What LLM Can I Run?

LLM VRAM Calculator

Self-Hosted LLM Cost Calculator

Ollama vs LM Studio vs llama.cpp: Which Local LLM Runner Should You Use?

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

Understanding LLM Tokens: How AI Models Count Words

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

Why run AI locally?

The four pieces of a local-AI stack

1. The runtime (what you install)

2. The model & its format

3. Quantization (the size/quality dial)

4. The hardware (will it run?)

How to actually get started

Sizing your hardware

When one GPU isn't enough

What performance to expect

Local AI for businesses & regulated industries

The local-AI toolkit (free tools)

Conclusion

Frequently Asked Questions

Let's turn this knowledge into action

Related Tools

What LLM Can I Run?

LLM VRAM Calculator

Self-Hosted LLM Cost Calculator

Related Articles

Ollama vs LM Studio vs llama.cpp: Which Local LLM Runner Should You Use?

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

Understanding LLM Tokens: How AI Models Count Words

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

What Is GGUF? Local AI Model Formats Explained (GGUF, safetensors, MLX, GPTQ, AWQ)

LLM Quantization Explained: How to Shrink Models Without Wrecking Quality