Skip to main content
Home/Blog/How to Run an LLM Locally: A Step-by-Step Guide for Beginners
Artificial Intelligence

How to Run an LLM Locally: A Step-by-Step Guide for Beginners

Run a large language model on your own computer in about ten minutes — no cloud, no API keys, no per-token fees. Pick a runtime, download a model, and chat privately on hardware you own.

By InventiveHQ Team

Running a large language model on your own machine is no longer a research-lab exercise. The tooling matured to the point where you can go from nothing installed to a private ChatGPT-style chat in about ten minutes — no cloud account, no API key, no per-token billing. Everything stays on hardware you own.

This guide walks a beginner through it twice: once with LM Studio (a desktop app, no terminal required) and once with Ollama (one command in a shell). Both are built on the same underlying engine — llama.cpp — and both run the same GGUF model files, so you are choosing an interface, not a quality tier.

What you need before you start

Three things: a computer with enough memory, one runtime, and one model file.

The memory requirement is the only real gate. Local models are sized by parameter count (8B = 8 billion parameters), and the file size depends on quantization — how many bits each parameter is stored in. The practical default, Q4_K_M, uses about 0.55 bytes per weight, so an 8-billion-parameter model is roughly 4.9 GB on disk and needs about 8 GB of memory once you add the context buffer (KV cache) and ~1–2 GB of runtime overhead.

Here is what each tier of model realistically requires. "Memory" means GPU VRAM on a PC, or unified memory on Apple Silicon (where the CPU and GPU share one pool).

Model size (Q4_K_M)File sizeMin. memory to run wellExample hardwareWhat it's good for
1–3B~1–2 GB4–6 GB RAM (CPU is fine)Any modern laptopQuick Q&A, autocomplete, drafts
7–8B~4.9 GB8 GB VRAM / unifiedRTX 3060 12 GB, M-series Mac 16 GBGeneral chat, coding help, the beginner default
13–14B~7.5–8 GB12 GB VRAM / unifiedRTX 4070, M-series Mac 24 GBStronger reasoning, longer answers
70B~38–43 GB~48 GB total2× RTX 4090, 1× 48 GB card, or 64 GB+ MacNear-frontier quality, serious work

You do not need a GPU to start. A 7–8B model runs on CPU and system RAM — just more slowly. A GPU (or Apple Silicon's unified memory) is what turns a sluggish few tokens per second into a fast, conversational stream.

Before downloading anything, confirm your machine's limits with the What LLM Can I Run? tool — it detects your GPU in one click and tells you which models fit — and the LLM VRAM Calculator for exact memory math at a given context length.

Step 1 — Pick a runtime

For a beginner, the choice is LM Studio or Ollama. They are the two most popular front ends, both free, both cross-platform, both wrapping llama.cpp.

LM StudioOllama
InterfaceFull desktop GUICommand line (+ REST API)
Best forNon-technical users, click-to-chatDevelopers, scripting, app integration
Model discoveryBuilt-in searchable catalogollama pull / model library
Install footprintLarger (Electron app)Lightweight CLI + background server
OpenAI-compatible APIYes — port 1234Yes — port 11434
Apple Silicon accelerationMLX backend (~30–50% faster than Metal)MLX backend in 0.19+ (32 GB+ Macs)
LicenseFree, closed-sourceFree, open-source
Chat memory by defaultSet in the UI2048 tokens unless overridden

The short version: want an app you click? Use LM Studio. Comfortable in a terminal, or planning to wire the model into code? Use Ollama. For a deeper feature-by-feature breakdown, see Ollama vs LM Studio vs llama.cpp.

The rest of this guide covers both paths. Follow whichever matches your choice.

Step 2 — Check what your hardware can run

Skipping this step is the most common beginner mistake: people pull a 70B model onto a 16 GB laptop, watch it spill into system RAM (or fail to load), and conclude local AI is "too slow." It isn't — they just picked a model that doesn't fit.

The quick rule of thumb for Q4_K_M: model size in GB ≈ parameter count in billions × 0.55, then leave a few GB of headroom for context and overhead. An 8B model needs ~5 GB of weights plus buffer → target 8 GB of free memory. A 13B needs ~8 GB of weights → target ~12 GB.

For anything beyond a back-of-the-envelope estimate — long context windows, multi-GPU setups, or Apple unified memory — use the LLM VRAM Calculator, which accounts for the KV cache that grows as your conversation gets longer. To estimate how fast a given model will run on your card before you download it, the LLM Inference Speed Calculator turns memory bandwidth into expected tokens/sec.

Step 3 — Install and download a model

Path A — LM Studio (GUI)

  1. Download LM Studio from its site and install it like any desktop app (Windows, macOS, or Linux).
  2. Open it and use the search/discover screen to find a model — type something like Llama 3.1 8B Instruct.
  3. Pick the Q4_K_M GGUF build from the list of quantizations (LM Studio shows file sizes and flags which ones fit your machine).
  4. Click Download. When it finishes, select it from the model dropdown to load it.

That's it — no terminal involved.

Path B — Ollama (CLI)

Install Ollama, then pull and run a model in a single command:

# macOS / Linux — install
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com instead

# Pull and run an 8B model (downloads on first run, ~4.9 GB)
ollama run llama3.1:8b

The first ollama run downloads the model, then drops you straight into an interactive prompt. To download without chatting, use ollama pull llama3.1:8b. If you're unsure of the exact flags or want to script pulls, the Ollama Command Builder generates the right ollama run/pull/create commands for you.

Which quantization to grab? Stick with Q4_K_M unless you have a reason not to. It is the best quality-per-gigabyte for almost everyone: only about a 3% perplexity increase over full FP16 precision, at roughly a quarter the file size. Go higher (Q5_K_M, Q6_K, or Q8_0) only if you have spare memory and want maximum fidelity; drop lower (Q3_K_M, Q2_K) only when a model otherwise won't fit, because quality degrades sharply below 4 bits.

Step 4 — Run your first chat

In LM Studio, type into the chat box and press enter — the experience mirrors a hosted chatbot. In Ollama, you're already at the >>> prompt after ollama run; type a message and press enter. Try something concrete:

Summarize the difference between TCP and UDP in three bullet points.

As the answer streams in, you'll see a tokens/sec readout (LM Studio shows it in the UI; Ollama prints stats when you run with --verbose). A token is about three-quarters of a word — roughly 4 characters — so 100 tokens is about 75 words. Anything above ~10 tok/s reads faster than most people; 30+ tok/s feels instant.

How to tell it's GPU-accelerated: the speed itself is the tell. Local decoding is memory-bandwidth-bound, so generation tracks how fast your hardware can read the model's weights, not how many FLOPs it has. The chart below shows representative single-stream decode speeds for an 8B model at Q4 across common hardware. If your 8B model runs at single digits, it's falling back to CPU; if it's in the dozens-to-hundreds, the GPU (or Apple unified memory) is doing the work.

Approximate single-stream decode speed for an 8B model at Q4, by hardware (tokens/sec) 0 50 100 150 200 250 tokens / sec (8B Q4, single stream — approximate) M3 Max ~45 M4 Max ~60 RTX 4090 ~115 A100 80GB ~160 RTX 5090 ~180 H100 80GB ~230

These are order-of-magnitude midpoints synthesized from the bandwidth-bound model and community benchmarks; actual numbers vary with engine, flash-attention, and context length. The point isn't the exact figure — it's the shape: an 8B model is fast on almost anything with a GPU.

Step 5 — Connect it to your apps (optional)

The moment local AI gets genuinely useful is when your own scripts and tools can call it. Both runtimes expose an OpenAI-compatible HTTP API, so any client written for OpenAI works by changing the base URL and using any placeholder API key.

RuntimeBase URLDefault port
Ollamahttp://localhost:11434/v111434
LM Studiohttp://localhost:1234/v11234

A minimal example with the OpenAI Python SDK pointed at a local Ollama server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # any non-empty string works locally
)

resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Say hello in one sentence."}],
)
print(resp.choices[0].message.content)

To generate ready-made server config — a Modelfile, environment variables, and an OpenAI-compatible snippet — use the Ollama Config Generator. For more on wiring a local model into production code and the gotchas involved, see Ollama vs LM Studio vs llama.cpp. If you'd rather not install anything at all, Local AI Chat runs a model directly in your browser via WebGPU.

Troubleshooting

Out of memory / model won't load. You picked a model bigger than your free memory. Drop to a smaller parameter count or a lower quantization (Q4_K_M → Q3_K_M), close other GPU-hungry apps, and re-check the fit with the LLM VRAM Calculator. Remember the file size is only part of it — the KV cache and ~1–2 GB of runtime overhead sit on top.

Generation is slow (a few tok/s). The model is running on CPU instead of the GPU, usually because it didn't fully fit in VRAM and spilled into system RAM. Use a model that fits entirely in your GPU/unified memory. See How much VRAM do I need to run an LLM? for the full memory breakdown.

The model forgets the conversation quickly. Ollama defaults to a 2048-token context window. Raise it by setting OLLAMA_CONTEXT_LENGTH=8192 (or higher) before starting the server. Be aware that larger context costs memory: the KV cache grows linearly with the number of tokens.

It answers oddly or repeats itself. You may be on too aggressive a quantization. Q2_K and below are clearly degraded — step up to at least Q4_K_M. If you're already there, try a different, current instruct model.

Next steps

Once your first 8B model is running smoothly, the natural progressions are:

  • Move up a size class. If a 70B fits your hardware (≈48 GB total memory), the quality jump is real — confirm fit first with What LLM Can I Run?.
  • Tune for speed and context. Raise the context window, enable flash-attention, and consider a quantized KV cache to push long conversations further without running out of memory.
  • Compare local vs cloud economics. If you're weighing buying a GPU against paying API fees, the Self-Hosted LLM Cost Calculator shows the break-even point.
  • Go deeper. The complete guide to running local AI covers serving to other machines, hardware selection, and production setups.

You now have a private, offline, fee-free language model running on hardware you already own — the foundation everything else builds on.

Frequently Asked Questions

Find answers to common questions

LM Studio if you want a graphical app: install it, search a model, click download, and chat — no terminal. Ollama if you're comfortable with a command line: install it, run ollama run llama3.1:8b, and you're talking to the model in one command. Both are built on top of llama.cpp and run the same GGUF model files, so quality is identical; you're only choosing the interface.

No. A small quantized model (1B–8B at Q4) runs entirely on CPU and system RAM — just slower, often a handful of tokens per second. A discrete GPU or Apple Silicon's unified memory speeds generation up dramatically because local decoding is memory-bandwidth-bound: an RTX 4090 (~1008 GB/s) will generate an 8B model at roughly 90–140 tokens/sec versus single-digit-to-low-double-digit speeds on CPU.

A current 8B-class instruct model at Q4_K_M. That weighs about 4.9 GB on disk, fits in 8 GB of VRAM or unified memory with room for context, and gives near-FP16 quality (Q4_K_M adds only ~3% perplexity over full precision). Llama 3.1 8B is the canonical first download. Use the What LLM Can I Run? tool to confirm it fits your machine before pulling anything bigger.

The runtimes (LM Studio, Ollama, llama.cpp) and the open-weight models are free to download and use. You pay only for the hardware you already own and the electricity it draws — there are no per-token API fees and no usage caps. The trade-off is that you provide the compute, so very large models need expensive GPUs or a high-memory Mac.

Quantization shrinks a model by storing each weight in fewer bits, trading a little accuracy for a much smaller file. Q4_K_M (~0.55 bytes/weight effective) is the default sweet spot — near-lossless quality at roughly a quarter the size of FP16. Step up to Q5_K_M or Q6_K if you have spare VRAM and want maximum fidelity; only drop to Q3_K_M or Q2_K when nothing else fits, since quality falls off sharply below 4 bits.

A token is roughly three-quarters of a word (about 4 characters), so 100 tokens ≈ 75 words. Tokens/sec is how fast the model generates text. Anything above ~10 tok/s reads faster than most people, and 30+ tok/s feels instant. Decode speed scales with memory bandwidth, not raw compute, which is why an 8B model is fast everywhere but a 70B model crawls without a high-bandwidth GPU or 80 GB card.

Yes. Both Ollama and LM Studio expose an OpenAI-compatible HTTP API, so any code or tool that talks to OpenAI works by changing the base URL. Ollama serves on port 11434 (/v1/chat/completions); LM Studio's local server runs on port 1234. Point your client at http://localhost:11434/v1 or http://localhost:1234/v1, use any placeholder API key, and call it like the OpenAI endpoint.

Ollama defaults to a 2048-token context window, which is small for long chats or document work. Raise it by setting the OLLAMA_CONTEXT_LENGTH environment variable (for example 8192 or higher) before starting the server, or set the context length per request. Larger context uses more memory because the KV cache grows linearly with the number of tokens.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.