Running a large language model on your own machine is no longer a research-lab exercise. The tooling matured to the point where you can go from nothing installed to a private ChatGPT-style chat in about ten minutes — no cloud account, no API key, no per-token billing. Everything stays on hardware you own.
This guide walks a beginner through it twice: once with LM Studio (a desktop app, no terminal required) and once with Ollama (one command in a shell). Both are built on the same underlying engine — llama.cpp — and both run the same GGUF model files, so you are choosing an interface, not a quality tier.
What you need before you start
Three things: a computer with enough memory, one runtime, and one model file.
The memory requirement is the only real gate. Local models are sized by parameter count (8B = 8 billion parameters), and the file size depends on quantization — how many bits each parameter is stored in. The practical default, Q4_K_M, uses about 0.55 bytes per weight, so an 8-billion-parameter model is roughly 4.9 GB on disk and needs about 8 GB of memory once you add the context buffer (KV cache) and ~1–2 GB of runtime overhead.
Here is what each tier of model realistically requires. "Memory" means GPU VRAM on a PC, or unified memory on Apple Silicon (where the CPU and GPU share one pool).
| Model size (Q4_K_M) | File size | Min. memory to run well | Example hardware | What it's good for |
|---|---|---|---|---|
| 1–3B | ~1–2 GB | 4–6 GB RAM (CPU is fine) | Any modern laptop | Quick Q&A, autocomplete, drafts |
| 7–8B | ~4.9 GB | 8 GB VRAM / unified | RTX 3060 12 GB, M-series Mac 16 GB | General chat, coding help, the beginner default |
| 13–14B | ~7.5–8 GB | 12 GB VRAM / unified | RTX 4070, M-series Mac 24 GB | Stronger reasoning, longer answers |
| 70B | ~38–43 GB | ~48 GB total | 2× RTX 4090, 1× 48 GB card, or 64 GB+ Mac | Near-frontier quality, serious work |
You do not need a GPU to start. A 7–8B model runs on CPU and system RAM — just more slowly. A GPU (or Apple Silicon's unified memory) is what turns a sluggish few tokens per second into a fast, conversational stream.
Before downloading anything, confirm your machine's limits with the What LLM Can I Run? tool — it detects your GPU in one click and tells you which models fit — and the LLM VRAM Calculator for exact memory math at a given context length.
Step 1 — Pick a runtime
For a beginner, the choice is LM Studio or Ollama. They are the two most popular front ends, both free, both cross-platform, both wrapping llama.cpp.
| LM Studio | Ollama | |
|---|---|---|
| Interface | Full desktop GUI | Command line (+ REST API) |
| Best for | Non-technical users, click-to-chat | Developers, scripting, app integration |
| Model discovery | Built-in searchable catalog | ollama pull / model library |
| Install footprint | Larger (Electron app) | Lightweight CLI + background server |
| OpenAI-compatible API | Yes — port 1234 | Yes — port 11434 |
| Apple Silicon acceleration | MLX backend (~30–50% faster than Metal) | MLX backend in 0.19+ (32 GB+ Macs) |
| License | Free, closed-source | Free, open-source |
| Chat memory by default | Set in the UI | 2048 tokens unless overridden |
The short version: want an app you click? Use LM Studio. Comfortable in a terminal, or planning to wire the model into code? Use Ollama. For a deeper feature-by-feature breakdown, see Ollama vs LM Studio vs llama.cpp.
The rest of this guide covers both paths. Follow whichever matches your choice.
Step 2 — Check what your hardware can run
Skipping this step is the most common beginner mistake: people pull a 70B model onto a 16 GB laptop, watch it spill into system RAM (or fail to load), and conclude local AI is "too slow." It isn't — they just picked a model that doesn't fit.
The quick rule of thumb for Q4_K_M: model size in GB ≈ parameter count in billions × 0.55, then leave a few GB of headroom for context and overhead. An 8B model needs ~5 GB of weights plus buffer → target 8 GB of free memory. A 13B needs ~8 GB of weights → target ~12 GB.
For anything beyond a back-of-the-envelope estimate — long context windows, multi-GPU setups, or Apple unified memory — use the LLM VRAM Calculator, which accounts for the KV cache that grows as your conversation gets longer. To estimate how fast a given model will run on your card before you download it, the LLM Inference Speed Calculator turns memory bandwidth into expected tokens/sec.
Step 3 — Install and download a model
Path A — LM Studio (GUI)
- Download LM Studio from its site and install it like any desktop app (Windows, macOS, or Linux).
- Open it and use the search/discover screen to find a model — type something like
Llama 3.1 8B Instruct. - Pick the Q4_K_M GGUF build from the list of quantizations (LM Studio shows file sizes and flags which ones fit your machine).
- Click Download. When it finishes, select it from the model dropdown to load it.
That's it — no terminal involved.
Path B — Ollama (CLI)
Install Ollama, then pull and run a model in a single command:
# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com instead
# Pull and run an 8B model (downloads on first run, ~4.9 GB)
ollama run llama3.1:8b
The first ollama run downloads the model, then drops you straight into an interactive prompt. To download without chatting, use ollama pull llama3.1:8b. If you're unsure of the exact flags or want to script pulls, the Ollama Command Builder generates the right ollama run/pull/create commands for you.
Which quantization to grab? Stick with Q4_K_M unless you have a reason not to. It is the best quality-per-gigabyte for almost everyone: only about a 3% perplexity increase over full FP16 precision, at roughly a quarter the file size. Go higher (Q5_K_M, Q6_K, or Q8_0) only if you have spare memory and want maximum fidelity; drop lower (Q3_K_M, Q2_K) only when a model otherwise won't fit, because quality degrades sharply below 4 bits.
Step 4 — Run your first chat
In LM Studio, type into the chat box and press enter — the experience mirrors a hosted chatbot. In Ollama, you're already at the >>> prompt after ollama run; type a message and press enter. Try something concrete:
Summarize the difference between TCP and UDP in three bullet points.
As the answer streams in, you'll see a tokens/sec readout (LM Studio shows it in the UI; Ollama prints stats when you run with --verbose). A token is about three-quarters of a word — roughly 4 characters — so 100 tokens is about 75 words. Anything above ~10 tok/s reads faster than most people; 30+ tok/s feels instant.
How to tell it's GPU-accelerated: the speed itself is the tell. Local decoding is memory-bandwidth-bound, so generation tracks how fast your hardware can read the model's weights, not how many FLOPs it has. The chart below shows representative single-stream decode speeds for an 8B model at Q4 across common hardware. If your 8B model runs at single digits, it's falling back to CPU; if it's in the dozens-to-hundreds, the GPU (or Apple unified memory) is doing the work.
These are order-of-magnitude midpoints synthesized from the bandwidth-bound model and community benchmarks; actual numbers vary with engine, flash-attention, and context length. The point isn't the exact figure — it's the shape: an 8B model is fast on almost anything with a GPU.
Step 5 — Connect it to your apps (optional)
The moment local AI gets genuinely useful is when your own scripts and tools can call it. Both runtimes expose an OpenAI-compatible HTTP API, so any client written for OpenAI works by changing the base URL and using any placeholder API key.
| Runtime | Base URL | Default port |
|---|---|---|
| Ollama | http://localhost:11434/v1 | 11434 |
| LM Studio | http://localhost:1234/v1 | 1234 |
A minimal example with the OpenAI Python SDK pointed at a local Ollama server:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # any non-empty string works locally
)
resp = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Say hello in one sentence."}],
)
print(resp.choices[0].message.content)
To generate ready-made server config — a Modelfile, environment variables, and an OpenAI-compatible snippet — use the Ollama Config Generator. For more on wiring a local model into production code and the gotchas involved, see Ollama vs LM Studio vs llama.cpp. If you'd rather not install anything at all, Local AI Chat runs a model directly in your browser via WebGPU.
Troubleshooting
Out of memory / model won't load. You picked a model bigger than your free memory. Drop to a smaller parameter count or a lower quantization (Q4_K_M → Q3_K_M), close other GPU-hungry apps, and re-check the fit with the LLM VRAM Calculator. Remember the file size is only part of it — the KV cache and ~1–2 GB of runtime overhead sit on top.
Generation is slow (a few tok/s). The model is running on CPU instead of the GPU, usually because it didn't fully fit in VRAM and spilled into system RAM. Use a model that fits entirely in your GPU/unified memory. See How much VRAM do I need to run an LLM? for the full memory breakdown.
The model forgets the conversation quickly. Ollama defaults to a 2048-token context window. Raise it by setting OLLAMA_CONTEXT_LENGTH=8192 (or higher) before starting the server. Be aware that larger context costs memory: the KV cache grows linearly with the number of tokens.
It answers oddly or repeats itself. You may be on too aggressive a quantization. Q2_K and below are clearly degraded — step up to at least Q4_K_M. If you're already there, try a different, current instruct model.
Next steps
Once your first 8B model is running smoothly, the natural progressions are:
- Move up a size class. If a 70B fits your hardware (≈48 GB total memory), the quality jump is real — confirm fit first with What LLM Can I Run?.
- Tune for speed and context. Raise the context window, enable flash-attention, and consider a quantized KV cache to push long conversations further without running out of memory.
- Compare local vs cloud economics. If you're weighing buying a GPU against paying API fees, the Self-Hosted LLM Cost Calculator shows the break-even point.
- Go deeper. The complete guide to running local AI covers serving to other machines, hardware selection, and production setups.
You now have a private, offline, fee-free language model running on hardware you already own — the foundation everything else builds on.