Pick the right DeepSeek variant for your hardware
The single biggest source of confusion with "running DeepSeek locally" is that "DeepSeek" is not one model. Before you pull anything, get the taxonomy straight, because it decides whether you need a $300 GPU or a $30,000 server.
There are two things people mean:
- The full models — DeepSeek-V3 (the general/base "deepseek-chat" line, ~V3.1) and DeepSeek-R1 (the reasoning model). These are 671B-parameter Mixture-of-Experts models. They are genuinely frontier-class, and they are open-weight, but at 671B parameters they do not fit on any single consumer GPU. Even quantized to 4-bit they need on the order of 350–400 GB of memory.
- The R1 distills —
DeepSeek-R1-Distill-Qwen-7B/14B/32BandDeepSeek-R1-Distill-Llama-8B/70B. These are not DeepSeek's own architecture. They are existing open models (Alibaba's Qwen2.5 and Meta's Llama-3) fine-tuned on reasoning traces generated by the full R1. You get R1-style "thinking" output at a size you can host on a laptop or a single gaming GPU. This is what 95% of local DeepSeek setups actually run.
The second axis is reasoning vs. base. R1 and its distills are reasoning models: they emit a long chain-of-thought (often wrapped in <think> tags) before the final answer. That makes them excellent at math, logic, and multi-step debugging — and slower and more token-hungry for everything else, because those thinking tokens are real tokens your GPU has to generate. DeepSeek-V3 is the non-reasoning general model; among the distills, treat them all as reasoning variants.
Honesty note: when someone says "I'm running DeepSeek on my 4090," they are almost always running a distill — a Qwen or Llama model wearing R1's reasoning, not the 671B original. That's not a downgrade to be ashamed of; it's the only sane choice on a single consumer card. Just know which one you have.
To size any of these against a specific card before you download a 20 GB file, run the numbers through our llm-vram-calculator or let what-llm-can-i-run detect your GPU and tell you which variants fit.
Hardware requirements
VRAM has three parts: weights + KV cache + ~1–2 GB runtime overhead. Weights dominate, and they follow a simple formula:
weights_bytes ≈ num_params × bytes_per_weight
At the recommended Q4_K_M quant, effective bytes-per-weight is about 0.55, so a 7-billion-parameter model is roughly 7e9 × 0.55 ≈ 3.9 GB of weights, plus KV cache and overhead (the R1-Distill-Qwen-7B is actually ~7.6B parameters, which is why it lands at ~4.4 GB in the table below). The table below maps each common DeepSeek variant to the GPU class that holds it comfortably at Q4_K_M with a usable (8K–16K) context window. "Fits" means the whole model lives in VRAM — no slow CPU offload.
| DeepSeek variant | Params | Q4_K_M weights | Total VRAM (8–16K ctx) | Card / tier that fits |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | 1.5B | ~1.1 GB | ~3 GB | 8 GB (RTX 3050/4060), any modern laptop |
| R1-Distill-Qwen-7B | 7B | ~4.4 GB | ~7–8 GB | 12 GB (RTX 3060/4070) |
| R1-Distill-Llama-8B | 8B | ~4.9 GB | ~8 GB | 12 GB (RTX 3060/4070) |
| R1-Distill-Qwen-14B | 14B | ~8.0 GB | ~12 GB | 16 GB (RTX 4060 Ti 16G / 4080) |
| R1-Distill-Qwen-32B | 32B | ~19 GB | ~22–24 GB | 24 GB (RTX 3090/4090) |
| R1-Distill-Llama-70B | 70B | ~40–43 GB | ~48 GB | 48 GB (1× RTX 6000 Ada) or 2× 24 GB |
| DeepSeek-V3 / R1 (full) | 671B MoE | ~350–400 GB | 400 GB+ | Multi-GPU server or device cluster |
A few practical reads of this table:
- 8 GB cards are real but cramped: the 1.5B distill runs fine, and a 7B/8B distill fits only if you drop to a smaller context or accept some CPU spillover.
- 12 GB is the consumer sweet spot for the 7B/8B distills with room for a real 8K–16K context.
- 16 GB lands the 14B distill, which is a noticeable step up in reasoning quality.
- 24 GB (RTX 4090) is the practical ceiling for a single consumer card: it holds the 32B distill, the strongest variant most people will run locally.
- 48 GB holds the 70B distill on one workstation card, or you split it across 2× 24 GB GPUs with llama.cpp's layer split.
- The full 671B model is out of single-box reach. You either rent a multi-GPU server (8× 80 GB-class cards) or pool unified memory across a cluster of Apple Silicon Macs (the exo/clustering route — running 671B at 8-bit takes roughly 512 GB of pooled memory). See our companion pieces on multi-GPU and clustering for that path.
If you have headroom, step up in quant before stepping down in size: a 14B at Q6_K beats a 32B squeezed to Q3. Quality degrades only mildly down to ~4 bits and falls off a cliff below it.
KV cache is the other consumer of VRAM, and it grows linearly with context. For these distill-sized models with GQA it is small at modest context — a few hundred MB at 8K — but if you push a 32B distill to 64K+ tokens, budget several extra GB or quantize the KV cache. The llm-vram-calculator folds KV and overhead in for you so you are not guessing.
Step-by-step: running with Ollama
Ollama is the fastest path. It wraps llama.cpp, manages GGUF downloads, and exposes both a native REST API and OpenAI-compatible routes.
1. Install and pull a variant. Pick the tag that matches your VRAM from the table above:
# 8B distill — good default for a 12 GB card
ollama run deepseek-r1:8b
# others: deepseek-r1:1.5b · :7b · :14b · :32b · :70b
The ollama run command pulls a Q4_K_M GGUF on first use, then drops you into a chat. The model streams a <think>...</think> reasoning block before its answer — that is expected for R1 distills.
2. Fix the context window. This is the one mistake everyone makes. Ollama's documented default context is 2048 tokens, which silently truncates long prompts and cuts off reasoning chains mid-thought. Raise it:
# server-wide, before launching
export OLLAMA_CONTEXT_LENGTH=16384
ollama serve
Or bake it into a Modelfile per-model:
cat > Modelfile <<'EOF'
FROM deepseek-r1:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.6
EOF
ollama create deepseek-r1-14b-16k -f Modelfile
ollama run deepseek-r1-14b-16k
DeepSeek recommends a temperature around 0.6 for R1 — higher tends to produce rambling chains, lower can collapse the reasoning. Our ollama-command-builder generates these pull/run commands and Modelfiles for any tag and context length so you don't have to memorize the flags.
3. Use the API. Ollama serves a REST API on port 11434 and OpenAI-compatible endpoints under /v1:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:8b",
"messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
}'
Because that endpoint speaks the OpenAI dialect, any client or SDK that points at base_url=http://localhost:11434/v1 works unchanged — point your existing tooling at it and you are done.
Step-by-step: running with llama.cpp
Reach for llama.cpp directly when you want control Ollama hides: a specific quant (IQ-quants to squeeze a bigger variant into tight VRAM), explicit multi-GPU splitting, or KV-cache quantization for long context.
1. Build it (with CUDA here; use -DGGML_METAL=ON on Apple Silicon):
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
2. Get a GGUF. Pull a quantized DeepSeek distill from a Hugging Face GGUF repo (the bartowski and unsloth repos publish the full quant ladder). Match the file to your VRAM:
# example: 14B distill, Q4_K_M
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf --local-dir ./models
3. Run the server. llama-server exposes its own OpenAI-compatible API on port 8080:
./build/bin/llama-server \
-m ./models/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
-c 16384 \ # context length — set it explicitly, no silent 2048 default
-ngl 99 \ # offload all layers to GPU
--host 0.0.0.0 --port 8080
Multi-GPU. To hold the 70B distill on 2× 24 GB cards, llama.cpp's default --split-mode layer assigns contiguous layer ranges to each GPU (low inter-GPU traffic). For two cards of unequal size, bias the split with --tensor-split 60,40. Use row/tensor split mode only if you have fast NVLink-class interconnect; for most consumer pairs, layer split is the right default. For the full 671B across separate machines you are into llama.cpp's RPC backend or a clustering framework — a different article.
Tight-VRAM tricks. If a variant is just over your budget, an IQ4_XS or IQ3 I-quant fits where Q4_K_M won't, at slightly slower decode. You can also quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 to roughly halve KV memory at long context.
Performance expectations
Decode speed is memory-bandwidth-bound, not compute-bound. A dense model reads (most of) its weights from VRAM for every token, so the ceiling is roughly:
tok/s ≈ memory_bandwidth ÷ bytes_read_per_token
Real throughput lands at ~50–80% of that ceiling. The practical implication: tokens/sec scales with your GPU's bandwidth, not its FLOPs, and bigger models are proportionally slower because there are more weight bytes to stream per token. The table below gives single-stream (batch=1) Q4 ballparks.
| Variant (Q4) | RTX 4090 (24 GB, ~1008 GB/s) | RTX 5090 (32 GB, ~1792 GB/s) | Apple M4 Max (~546 GB/s) |
|---|---|---|---|
| 7B / 8B distill | ~90–140 tok/s | ~140–220 tok/s | ~40–80 tok/s |
| 14B distill | ~50–80 tok/s | ~90–130 tok/s | ~25–45 tok/s |
| 32B distill | ~25–40 tok/s | ~45–70 tok/s | ~12–22 tok/s |
| 70B distill | doesn't fit 1× → 2×4090 ~18–25 | needs 2× → ~30–45 | ~8–13 tok/s (unified RAM) |
These are order-of-magnitude estimates from the bandwidth model plus community benchmarks; your real numbers shift with engine, flash-attention, quant kernel, and context length (longer context = slower, because attention has to scan a bigger KV cache). For a tighter estimate on your exact card, our llm-inference-speed-calculator derives tok/s from bandwidth, and llm-gpu-benchmark measures your real WebGPU bandwidth in-browser.
The reasoning caveat matters here. An R1 distill emits a long thinking block before its answer, so wall-clock time to a useful response is longer than the raw tok/s suggests — a "20 tok/s" 32B distill might spend 800 tokens reasoning before 200 tokens of answer. If you mostly want fast code or chat rather than deep reasoning, a same-size non-reasoning model finishes sooner; our companion guide on running Qwen3-Coder locally covers that trade-off for coding workloads specifically.
Connecting it to your tools
Both Ollama (:11434/v1) and llama-server (:8080/v1) speak the OpenAI dialect, so wiring DeepSeek into an existing app is usually a two-line change: swap the base_url to your local endpoint and set any non-empty API key. Agent frameworks, IDE plugins, and chat UIs that accept a custom OpenAI base URL work unchanged.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[{"role": "user", "content": "Refactor this function for readability."}],
)
Two things to plan for once you go past a single laptop. First, one machine serves one stream well but concurrency poorly — Ollama and llama-server are single-node; for many simultaneous users you want a throughput engine (vLLM) or multiple nodes behind a router. Second, a local box has no failover — when it's asleep, rebooting, or saturated, requests fail. If you're weighing local cost against the API, our self-hosted-llm-cost-calculator shows the break-even point against DeepSeek's own hosted API (roughly $0.14/$0.28 per 1M tokens for V3, $0.55/$2.19 for R1), and the llm-token-counter helps you estimate the token volume that decision hinges on.
Conclusion
The rule is simple: match the variant to the silicon, and start smaller than you think. On a 12 GB card, run the 7B/8B R1 distill; on a 16 GB card, the 14B; on a 24 GB RTX 4090, the 32B — that last one is the strongest DeepSeek most people will ever run locally. Reserve the 70B distill for a 48 GB workstation or a 2× 24 GB pair, and treat the full 671B V3/R1 as a server-or-cluster project, not a desktop one. Pull it with Ollama in one command, fix the 2048-token context default before you do anything serious, and reach for llama.cpp when you need a specific quant or multi-GPU control. Size it first with the llm-vram-calculator and what-llm-can-i-run so your first download is the one that actually fits.