Skip to main content
Home/Blog/Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware
Artificial Intelligence

Running Local AI: The Complete Guide to Self-Hosting LLMs on Your Own Hardware

Everything you need to run large language models on hardware you own — runtimes, model formats, quantization, VRAM math, multi-GPU, Apple Silicon, and how to serve it all behind one endpoint. The hub for our local-AI series.

By InventiveHQ Team

This is the hub article for our local-AI series. Each section links to a deep-dive. Start here, then follow the thread that matches your question.

Why run AI locally?

There are four concrete reasons, and none of them are hype.

Privacy. Your prompts and the model's outputs never leave the machine. For source code, customer records, legal documents, or PHI, that is the difference between "we use AI" and "we can't use AI."

Cost. On hardware you already own, inference has no per-token fee. A cloud call to GPT-5.5 costs $5 per million input tokens and $30 per million output; Claude Opus 4.8 is $5 / $25. At volume those numbers compound. The self-hosted LLM cost calculator shows where your break-even against cloud APIs actually falls — it's usually sooner than people expect for high-volume, repetitive workloads.

Control. No rate limits, no surprise deprecations (the cloud lineup churns constantly — o3 and the original GPT-5 are already being retired), no model swapped out from under your prompts. It runs offline, on a plane, in an air-gapped facility.

The honest trade-off: you own the hardware and the ops, and you are capped at what your GPU can hold. A 24 GB RTX 4090 runs an 8B model beautifully but cannot fit a 70B at Q4 without spilling to system RAM (which is slow). Knowing where that ceiling sits is most of the battle — and it's what the rest of this guide is about.

The four pieces of a local-AI stack

Every local-AI setup is the same four layers. Get the mental model and everything else falls into place: you pick a runtime, point it at a model packaged in a format, and run it on your hardware. Each layer constrains the one below it.

The four-layer local-AI stack: Runtime selects a Model, packaged in a Format, executed on Hardware 1 · Runtime Ollama · LM Studio · llama.cpp · vLLM — what you install & the API you call ↓ loads 2 · Model Llama 4 · DeepSeek-V3 · Qwen · Mistral · Gemma — the weights & their size (params) ↓ packaged as 3 · Format + Quantization GGUF (Q4_K_M…Q8_0) · safetensors · MLX — the size/quality dial ↓ runs on 4 · Hardware NVIDIA GPU VRAM · Apple Silicon unified memory · CPU + RAM

1. The runtime (what you install)

The runtime is the program that loads weights into memory and answers requests. The four that matter are Ollama, LM Studio, llama.cpp, and vLLM. They share one superpower: every one of them exposes an OpenAI-compatible API, so the same client code works against any of them (and against the cloud). Full comparison in Ollama vs LM Studio vs llama.cpp; the quick decision table:

RuntimeBest forInterfaceAPI portNative formatApple Silicon
OllamaFastest start, CLI + scriptingCommand line + REST11434 (/v1 OpenAI)GGUFMLX backend (0.19+, 32 GB+)
LM StudioPolished desktop GUIApp + local server1234 (/v1, also /v1/messages)GGUFMLX (default when available)
llama.cppMax control, the raw engineCLI + llama-server8080 (configurable)GGUF (originated it)Metal / MLX-backed builds
vLLMHigh-throughput, many usersPython server8000 (/v1 OpenAI)safetensors / AWQ / GPTQ / FP8Not the target (datacenter GPU)

One gotcha worth internalizing now: Ollama defaults to a 2048-token context window. For real coding or long-document work you must raise it (OLLAMA_CONTEXT_LENGTH=8192 or higher) or the model silently truncates. The ollama-config-generator sets this correctly for you.

2. The model & its format

The model is the weights (Llama 4, DeepSeek-V3, Qwen, Mistral, Gemma). Its size in parameters — 8B, 13B, 70B — is the single biggest driver of how much memory you need. The format is how those weights are packaged on disk. For desktop runtimes that's almost always GGUF, the quantized container originated by the llama.cpp project and now the de-facto standard. vLLM is the exception: it wants HuggingFace safetensors (FP16/BF16, AWQ, GPTQ, FP8) and treats GGUF as experimental, single-file-only, and explicitly recommends you use llama.cpp instead if GGUF is all you have. On Macs there's also Apple's native MLX format, now the high-performance backend under both Ollama and LM Studio.

3. Quantization (the size/quality dial)

Quantization shrinks weights from 16-bit floats down to 8, 6, 5, 4, even 2 bits each. Fewer bits per weight means a smaller file and less VRAM, at some cost to quality. The GGUF naming (Q4_K_M, Q5_K_M, Q8_0…) encodes exactly how aggressive the squeeze is. The headline numbers:

QuantEffective bytes/weight8B weights70B weightsQuality vs FP16
FP16 / BF162.0~16 GB~140 GBReference (lossless)
Q8_0~1.06~8.5 GB~74 GBIndistinguishable
Q6_K~0.82~6.6 GB~57 GBNear-lossless
Q5_K_M~0.71~5.7 GB~50 GBNear-lossless
Q4_K_M~0.55~4.4–4.9 GB~38–43 GB+1–3% perplexity — the default
Q3_K_M~0.49~3.9 GB~34 GBUsable, noticeably weaker
Q2_K~0.41~3.3 GB~29 GBClearly degraded (last resort)

Q4_K_M is the right default for almost everyone — about +3.3% perplexity on Llama-3.1-8B (ppl ~7.56 vs ~7.32 FP16) for roughly a quarter of the FP16 size. Quality stays mild down to ~4-bit, then degrades sharply below it. Full decoder ring (what _K, _S/_M/_L, and the IQ i-quants mean) in the LLM quantization deep-dive.

4. The hardware (will it run?)

This is the question everyone actually has. Total memory needed is weights + KV cache + ~1–2 GB overhead, and the weights number comes straight from the table above. The fastest way to answer it: run what-llm-can-i-run — it detects your GPU in one click and tells you which models fit — or punch a specific model and GPU into the llm-vram-calculator. Both are covered in the How much VRAM do you need? deep-dive.

How to actually get started

The 10-minute path on most machines: install Ollama, pull a model, chat. Raise the context window first so you don't get silently truncated.

# 1. install (macOS/Linux) and pull an 8B model in Q4_K_M
ollama pull llama3.1:8b

# 2. give it a real context window, not the 2048 default
OLLAMA_CONTEXT_LENGTH=8192 ollama serve &

# 3. talk to it — or hit the OpenAI-compatible API on :11434/v1
ollama run llama3.1:8b "Summarize the attached changelog in five bullets."

If you'd rather not memorize flags, the ollama-command-builder generates the exact run/pull/create commands and Modelfiles for you. Prefer a GUI with no terminal at all? Install LM Studio, search a model, click download, click load. Want to try before installing anything? The local-ai-chat tool runs a model entirely in your browser via WebGPU. Step-by-step walkthrough in How to run an LLM locally.

Sizing your hardware

Weights are the easy part. The piece people forget is the KV cache — the memory that holds attention state for every token in the context window, and it grows linearly with context length:

kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem

For a Llama-3 70B-class model (80 layers, 8 KV heads via GQA, head_dim 128, FP16 KV) that's ~0.3125 MB per token:

ContextKV cache (FP16)KV cache (FP8/INT8)
4K~1.25 GB~0.625 GB
32K~10 GB~5 GB
128K~40 GB~20 GB

At 128K context the KV cache alone (~40 GB) is as large as the quantized weights. Two levers tame it: GQA (grouped-query attention — Llama-3 70B's 64 query heads collapse to 8 KV heads, an 8× reduction; without it, 128K would need ~320 GB) and quantized KV cache (FP8/INT8 K and V halves it again). On a Mac, unified memory changes the math entirely — the GPU can address up to 128 GB, so an M4 Max can hold a 70B model that no single consumer NVIDIA card can. Use the llm-vram-calculator to combine weights + KV + overhead for your exact context length; how much VRAM walks the formula end to end.

When one GPU isn't enough

When a model won't fit one card, you have three escalating options.

Multiple GPUs in one box. Text transformers shard cleanly because they're a stack of identical decoder layers — big, regular matmuls with an obvious split dimension. llama.cpp's --split-mode does this (layer splits layers across GPUs, row/tensor splits the weight matrices); vLLM does tensor parallelism natively. Worth knowing: this is why LLMs split well and diffusion image/video models don't — their U-Net graphs have skip connections and a serial denoise loop that wreck tensor parallelism. Details in splitting LLM models across GPUs.

Multiple machines. llama.cpp's RPC backend pipeline-parallelizes a model across N hosts; exo clusters heterogeneous devices (Macs, PCs, even Raspberry Pis) peer-to-peer, pooling memory to fit models too large for any one box — e.g. DeepSeek-V3 671B at 8-bit across 8× M4 Mac minis. The catch: clustering buys you capacity, not speed (throughput is bounded by the slowest link). See clustering machines for local AI.

The operational answer. Hardware aside, the real production question is: how do my apps reach this? The clean answer is one OpenAI-compatible endpoint in front of your hardware. Every runtime here speaks that protocol, so you point your app at a single URL and swap models underneath. When you outgrow a single node — you need failover, edge caching, or burst capacity for the occasional hard request — that endpoint is where a gateway like WideAreaAI (an edge-first AI gateway that routes each request: edge cache → your own llama.cpp hardware → cloud burst failover) fits in. Own your baseline, burst to the cloud — edge-first, cloud when you choose. The endpoint pattern itself is covered in an OpenAI-compatible endpoint for your local LLM and edge caching for LLM requests.

What performance to expect

Decode speed is memory-bandwidth-bound, not FLOPs-bound. The mental model:

tok/s ≈ memory_bandwidth ÷ active_bytes_read_per_token   (real-world ≈ 50–80% of this)

A dense model reads roughly all its weight bytes per token, so doubling bandwidth roughly doubles tok/s. Ballpark single-stream decode (Q4):

GPUBandwidth8B Q4 (~5 GB)70B Q4 (~40 GB)
RTX 4090 (24 GB)~1008 GB/s~90–140 tok/sdoesn't fit → offload ~15–20
RTX 5090 (32 GB)~1792 GB/s~140–220 tok/sneeds 2× → ~30–45
A100 80GB~2039 GB/s~120–200 tok/s~25–40 tok/s
H100 80GB~3350 GB/s~180–280 tok/s~40–60 tok/s
Apple M4 Max~546 GB/s~40–80 tok/s~8–13 tok/s

These are order-of-magnitude estimates — actual numbers swing with engine, batch size, flash-attention, quant kernel, and context length. To pin yours down, the llm-inference-speed-calculator estimates tok/s from bandwidth/model/quant, and llm-gpu-benchmark measures your real WebGPU bandwidth and throughput in-browser. Deep-dive: what performance to expect from a local LLM.

Local AI for businesses & regulated industries

For an MSP, a law firm, a clinic, or a financial-services team, "the prompt never leaves our network" is not a feature — it's the entry ticket. On-prem local AI keeps PHI, privileged documents, and source code inside your perimeter, satisfies data-residency requirements, and removes the per-token meter for high-volume internal workloads. The realistic architecture for most regulated shops is hybrid: a local baseline on owned hardware for the 95% of requests that are routine, with a deliberate, audited path to a cloud model for the hard 5% — rather than sending everything to a third party by default.

That hybrid pattern is exactly where an edge-first AI gateway earns its place: WideAreaAI gives your apps one OpenAI-compatible LLM endpoint and routes each request edge cache → your own GPU/llama.cpp node (a markup-free baseline with no per-token fees on hardware you own) → cloud burst only when you choose. It does request-level routing, failover, and edge caching across whole nodes — not model-splitting across machines. Full compliance framing in on-prem AI for regulated industries.

The local-AI toolkit (free tools)

Every tool below runs in your browser — no signup, nothing uploaded. They map directly onto the four layers above.

ToolWhat it answers
what-llm-can-i-runOne-click GPU detection → which LLMs you can run
llm-vram-calculatorVRAM any model needs; GPU fit, multi-GPU, Apple Silicon
fine-tuning-vram-calculatorVRAM for full / LoRA / QLoRA training
llm-inference-speed-calculatorEstimate tok/s from bandwidth/model/quant; compare GPUs
llm-gpu-benchmarkReal in-browser WebGPU bandwidth + tokens/sec
self-hosted-llm-cost-calculatorCloud API cost vs own-hardware break-even
llm-token-counterToken counts + cost + context-fit across models
ollama-command-builderBuild ollama run/pull/create commands + Modelfiles
ollama-config-generatorModelfile + server env + OpenAI-compatible snippet
local-ai-chatIn-browser WebGPU chat (try a model with zero install)
private-ai-summarizerIn-browser WebGPU summarizer (text never leaves the tab)

Conclusion

Local AI comes down to four layers — runtime, model, format, hardware — and one question at each: what do I install, which weights, how aggressively quantized, and will it fit. Get those right and you have private, unmetered, offline AI on hardware you control, with a clean OpenAI-compatible endpoint your apps already know how to call.

Pick your next thread:

Frequently Asked Questions

Find answers to common questions

Privacy (prompts and outputs never leave your machine), zero per-token cost once you own the hardware, no rate limits, offline operation, and full data-residency/compliance control. The trade-offs are real: you manage the hardware, and you cap out at whatever your GPU or unified memory can hold. A 24 GB card runs an 8B model comfortably but won't fit a 70B at Q4 without offloading or a second GPU.

Three things: a runtime (Ollama, LM Studio, llama.cpp, or vLLM), a model file in a supported format (usually GGUF for desktop runtimes, safetensors for vLLM), and enough VRAM or unified memory for the model weights plus the KV cache for your context length plus 1–2 GB of runtime overhead. Use the VRAM calculator and 'What LLM Can I Run?' tools to size it before you download a 40 GB file.

Roughly: (parameters × bytes-per-weight from your quantization) + KV cache for your context + ~1–2 GB overhead. An 8B model at Q4_K_M is about 4.4–4.9 GB of weights; a 70B at Q4_K_M is ~38–43 GB and needs about 48 GB total at modest context — so a single 48 GB card or two 24 GB GPUs. The VRAM article and llm-vram-calculator do this math for any model and GPU.

On the hardest reasoning and coding tasks, frontier cloud models still lead. But strong open-weight models like Llama 4 and DeepSeek-V3 are genuinely capable for summarization, extraction, classification, RAG, and most day-to-day work. The local win is privacy, cost, and control — and a hybrid setup (local baseline, cloud burst for the hard 5%) captures both sides.

Ollama if you want a one-command CLI and an OpenAI-compatible API on port 11434. LM Studio if you want a polished desktop GUI. llama.cpp if you want maximum control and the raw engine. vLLM if you're serving many concurrent users on datacenter GPUs. Most people should start with Ollama or LM Studio and graduate to llama.cpp or vLLM only when they hit a wall.

Yes, and Apple Silicon is one of the best value paths because unified memory lets the GPU address up to 128 GB. An M4 Max runs an 8B model at ~40–80 tok/s and can fit a 70B in memory (at ~8–13 tok/s). Both Ollama (0.19+) and LM Studio now use Apple's MLX backend on Apple Silicon for a meaningful speedup over the old Metal path.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.