Skip to main content
Home/Blog/Run DeepSeek Locally: Hardware Requirements and Step-by-Step Setup
Artificial Intelligence

Run DeepSeek Locally: Hardware Requirements and Step-by-Step Setup

How to self-host DeepSeek models on your own hardware — which variant and quantization to pick for your VRAM, how to run it with Ollama or llama.cpp, and what performance to expect.

By InventiveHQ Team

Pick the right DeepSeek variant for your hardware

The single biggest source of confusion with "running DeepSeek locally" is that "DeepSeek" is not one model. Before you pull anything, get the taxonomy straight, because it decides whether you need a $300 GPU or a $30,000 server.

There are two things people mean:

  • The full models — DeepSeek-V3 (the general/base "deepseek-chat" line, ~V3.1) and DeepSeek-R1 (the reasoning model). These are 671B-parameter Mixture-of-Experts models. They are genuinely frontier-class, and they are open-weight, but at 671B parameters they do not fit on any single consumer GPU. Even quantized to 4-bit they need on the order of 350–400 GB of memory.
  • The R1 distillsDeepSeek-R1-Distill-Qwen-7B/14B/32B and DeepSeek-R1-Distill-Llama-8B/70B. These are not DeepSeek's own architecture. They are existing open models (Alibaba's Qwen2.5 and Meta's Llama-3) fine-tuned on reasoning traces generated by the full R1. You get R1-style "thinking" output at a size you can host on a laptop or a single gaming GPU. This is what 95% of local DeepSeek setups actually run.

The second axis is reasoning vs. base. R1 and its distills are reasoning models: they emit a long chain-of-thought (often wrapped in <think> tags) before the final answer. That makes them excellent at math, logic, and multi-step debugging — and slower and more token-hungry for everything else, because those thinking tokens are real tokens your GPU has to generate. DeepSeek-V3 is the non-reasoning general model; among the distills, treat them all as reasoning variants.

Honesty note: when someone says "I'm running DeepSeek on my 4090," they are almost always running a distill — a Qwen or Llama model wearing R1's reasoning, not the 671B original. That's not a downgrade to be ashamed of; it's the only sane choice on a single consumer card. Just know which one you have.

To size any of these against a specific card before you download a 20 GB file, run the numbers through our llm-vram-calculator or let what-llm-can-i-run detect your GPU and tell you which variants fit.

Hardware requirements

VRAM has three parts: weights + KV cache + ~1–2 GB runtime overhead. Weights dominate, and they follow a simple formula:

weights_bytes ≈ num_params × bytes_per_weight

At the recommended Q4_K_M quant, effective bytes-per-weight is about 0.55, so a 7-billion-parameter model is roughly 7e9 × 0.55 ≈ 3.9 GB of weights, plus KV cache and overhead (the R1-Distill-Qwen-7B is actually ~7.6B parameters, which is why it lands at ~4.4 GB in the table below). The table below maps each common DeepSeek variant to the GPU class that holds it comfortably at Q4_K_M with a usable (8K–16K) context window. "Fits" means the whole model lives in VRAM — no slow CPU offload.

DeepSeek variantParamsQ4_K_M weightsTotal VRAM (8–16K ctx)Card / tier that fits
R1-Distill-Qwen-1.5B1.5B~1.1 GB~3 GB8 GB (RTX 3050/4060), any modern laptop
R1-Distill-Qwen-7B7B~4.4 GB~7–8 GB12 GB (RTX 3060/4070)
R1-Distill-Llama-8B8B~4.9 GB~8 GB12 GB (RTX 3060/4070)
R1-Distill-Qwen-14B14B~8.0 GB~12 GB16 GB (RTX 4060 Ti 16G / 4080)
R1-Distill-Qwen-32B32B~19 GB~22–24 GB24 GB (RTX 3090/4090)
R1-Distill-Llama-70B70B~40–43 GB~48 GB48 GB (1× RTX 6000 Ada) or 2× 24 GB
DeepSeek-V3 / R1 (full)671B MoE~350–400 GB400 GB+Multi-GPU server or device cluster

A few practical reads of this table:

  • 8 GB cards are real but cramped: the 1.5B distill runs fine, and a 7B/8B distill fits only if you drop to a smaller context or accept some CPU spillover.
  • 12 GB is the consumer sweet spot for the 7B/8B distills with room for a real 8K–16K context.
  • 16 GB lands the 14B distill, which is a noticeable step up in reasoning quality.
  • 24 GB (RTX 4090) is the practical ceiling for a single consumer card: it holds the 32B distill, the strongest variant most people will run locally.
  • 48 GB holds the 70B distill on one workstation card, or you split it across 2× 24 GB GPUs with llama.cpp's layer split.
  • The full 671B model is out of single-box reach. You either rent a multi-GPU server (8× 80 GB-class cards) or pool unified memory across a cluster of Apple Silicon Macs (the exo/clustering route — running 671B at 8-bit takes roughly 512 GB of pooled memory). See our companion pieces on multi-GPU and clustering for that path.

If you have headroom, step up in quant before stepping down in size: a 14B at Q6_K beats a 32B squeezed to Q3. Quality degrades only mildly down to ~4 bits and falls off a cliff below it.

Total VRAM required by DeepSeek variant at Q4_K_M, versus common GPU memory tiers 0 10 20 30 40 50 Total VRAM (GB) 3 8 12 24 48 1.5B 7B 14B 32B 70B 12 GB card 24 GB (RTX 4090) 48 GB workstation DeepSeek-R1 distill (Q4_K_M, ~8–16K context). Full 671B is off this chart at 400 GB+.

KV cache is the other consumer of VRAM, and it grows linearly with context. For these distill-sized models with GQA it is small at modest context — a few hundred MB at 8K — but if you push a 32B distill to 64K+ tokens, budget several extra GB or quantize the KV cache. The llm-vram-calculator folds KV and overhead in for you so you are not guessing.

Step-by-step: running with Ollama

Ollama is the fastest path. It wraps llama.cpp, manages GGUF downloads, and exposes both a native REST API and OpenAI-compatible routes.

1. Install and pull a variant. Pick the tag that matches your VRAM from the table above:

# 8B distill — good default for a 12 GB card
ollama run deepseek-r1:8b

# others: deepseek-r1:1.5b · :7b · :14b · :32b · :70b

The ollama run command pulls a Q4_K_M GGUF on first use, then drops you into a chat. The model streams a <think>...</think> reasoning block before its answer — that is expected for R1 distills.

2. Fix the context window. This is the one mistake everyone makes. Ollama's documented default context is 2048 tokens, which silently truncates long prompts and cuts off reasoning chains mid-thought. Raise it:

# server-wide, before launching
export OLLAMA_CONTEXT_LENGTH=16384
ollama serve

Or bake it into a Modelfile per-model:

cat > Modelfile <<'EOF'
FROM deepseek-r1:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.6
EOF
ollama create deepseek-r1-14b-16k -f Modelfile
ollama run deepseek-r1-14b-16k

DeepSeek recommends a temperature around 0.6 for R1 — higher tends to produce rambling chains, lower can collapse the reasoning. Our ollama-command-builder generates these pull/run commands and Modelfiles for any tag and context length so you don't have to memorize the flags.

3. Use the API. Ollama serves a REST API on port 11434 and OpenAI-compatible endpoints under /v1:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:8b",
    "messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
  }'

Because that endpoint speaks the OpenAI dialect, any client or SDK that points at base_url=http://localhost:11434/v1 works unchanged — point your existing tooling at it and you are done.

Step-by-step: running with llama.cpp

Reach for llama.cpp directly when you want control Ollama hides: a specific quant (IQ-quants to squeeze a bigger variant into tight VRAM), explicit multi-GPU splitting, or KV-cache quantization for long context.

1. Build it (with CUDA here; use -DGGML_METAL=ON on Apple Silicon):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

2. Get a GGUF. Pull a quantized DeepSeek distill from a Hugging Face GGUF repo (the bartowski and unsloth repos publish the full quant ladder). Match the file to your VRAM:

# example: 14B distill, Q4_K_M
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
  DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf --local-dir ./models

3. Run the server. llama-server exposes its own OpenAI-compatible API on port 8080:

./build/bin/llama-server \
  -m ./models/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
  -c 16384 \           # context length — set it explicitly, no silent 2048 default
  -ngl 99 \            # offload all layers to GPU
  --host 0.0.0.0 --port 8080

Multi-GPU. To hold the 70B distill on 2× 24 GB cards, llama.cpp's default --split-mode layer assigns contiguous layer ranges to each GPU (low inter-GPU traffic). For two cards of unequal size, bias the split with --tensor-split 60,40. Use row/tensor split mode only if you have fast NVLink-class interconnect; for most consumer pairs, layer split is the right default. For the full 671B across separate machines you are into llama.cpp's RPC backend or a clustering framework — a different article.

Tight-VRAM tricks. If a variant is just over your budget, an IQ4_XS or IQ3 I-quant fits where Q4_K_M won't, at slightly slower decode. You can also quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 to roughly halve KV memory at long context.

Performance expectations

Decode speed is memory-bandwidth-bound, not compute-bound. A dense model reads (most of) its weights from VRAM for every token, so the ceiling is roughly:

tok/s ≈ memory_bandwidth ÷ bytes_read_per_token

Real throughput lands at ~50–80% of that ceiling. The practical implication: tokens/sec scales with your GPU's bandwidth, not its FLOPs, and bigger models are proportionally slower because there are more weight bytes to stream per token. The table below gives single-stream (batch=1) Q4 ballparks.

Variant (Q4)RTX 4090 (24 GB, ~1008 GB/s)RTX 5090 (32 GB, ~1792 GB/s)Apple M4 Max (~546 GB/s)
7B / 8B distill~90–140 tok/s~140–220 tok/s~40–80 tok/s
14B distill~50–80 tok/s~90–130 tok/s~25–45 tok/s
32B distill~25–40 tok/s~45–70 tok/s~12–22 tok/s
70B distilldoesn't fit 1× → 2×4090 ~18–25needs 2× → ~30–45~8–13 tok/s (unified RAM)

These are order-of-magnitude estimates from the bandwidth model plus community benchmarks; your real numbers shift with engine, flash-attention, quant kernel, and context length (longer context = slower, because attention has to scan a bigger KV cache). For a tighter estimate on your exact card, our llm-inference-speed-calculator derives tok/s from bandwidth, and llm-gpu-benchmark measures your real WebGPU bandwidth in-browser.

The reasoning caveat matters here. An R1 distill emits a long thinking block before its answer, so wall-clock time to a useful response is longer than the raw tok/s suggests — a "20 tok/s" 32B distill might spend 800 tokens reasoning before 200 tokens of answer. If you mostly want fast code or chat rather than deep reasoning, a same-size non-reasoning model finishes sooner; our companion guide on running Qwen3-Coder locally covers that trade-off for coding workloads specifically.

Connecting it to your tools

Both Ollama (:11434/v1) and llama-server (:8080/v1) speak the OpenAI dialect, so wiring DeepSeek into an existing app is usually a two-line change: swap the base_url to your local endpoint and set any non-empty API key. Agent frameworks, IDE plugins, and chat UIs that accept a custom OpenAI base URL work unchanged.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[{"role": "user", "content": "Refactor this function for readability."}],
)

Two things to plan for once you go past a single laptop. First, one machine serves one stream well but concurrency poorly — Ollama and llama-server are single-node; for many simultaneous users you want a throughput engine (vLLM) or multiple nodes behind a router. Second, a local box has no failover — when it's asleep, rebooting, or saturated, requests fail. If you're weighing local cost against the API, our self-hosted-llm-cost-calculator shows the break-even point against DeepSeek's own hosted API (roughly $0.14/$0.28 per 1M tokens for V3, $0.55/$2.19 for R1), and the llm-token-counter helps you estimate the token volume that decision hinges on.

Conclusion

The rule is simple: match the variant to the silicon, and start smaller than you think. On a 12 GB card, run the 7B/8B R1 distill; on a 16 GB card, the 14B; on a 24 GB RTX 4090, the 32B — that last one is the strongest DeepSeek most people will ever run locally. Reserve the 70B distill for a 48 GB workstation or a 2× 24 GB pair, and treat the full 671B V3/R1 as a server-or-cluster project, not a desktop one. Pull it with Ollama in one command, fix the 2048-token context default before you do anything serious, and reach for llama.cpp when you need a specific quant or multi-GPU control. Size it first with the llm-vram-calculator and what-llm-can-i-run so your first download is the one that actually fits.

Frequently Asked Questions

Find answers to common questions

It depends entirely on which DeepSeek you mean. The DeepSeek-R1 distilled models — Qwen and Llama backbones fine-tuned on R1's reasoning traces — run on consumer GPUs: the 7B/8B distills fit a 12 GB card at Q4_K_M, the 14B fits 16 GB, and the 32B fits a 24 GB RTX 4090. The full 671B DeepSeek-V3/R1 is a different universe: even at 4-bit it needs roughly 350–400 GB of memory, meaning a multi-GPU server or a clustered pool of unified-memory Macs. Use our llm-vram-calculator and What LLM Can I Run? tools to size your exact card before downloading anything.

For local use on a single consumer GPU, run one of the DeepSeek-R1 distills sized to your VRAM — DeepSeek-R1-Distill-Qwen-7B/14B/32B or the Llama-8B/70B distills. These are NOT the full DeepSeek model; they are smaller open models (Qwen2.5, Llama-3) trained to imitate R1's chain-of-thought, so you get reasoning-style output at a size you can actually host. The full 671B DeepSeek-V3 (general/base) and DeepSeek-R1 (reasoning) only make sense on multi-GPU servers or device clusters. Pick reasoning distills for math/logic/code; if you just want fast general chat, a non-reasoning model of the same size will be quicker since it won't spend tokens 'thinking'.

Install Ollama, then ollama run deepseek-r1:8b (or :7b, :14b, :32b, :70b) — Ollama pulls a Q4_K_M GGUF and starts a chat session plus a REST API on port 11434 with OpenAI-compatible routes under /v1. The critical gotcha: Ollama's default context window is only 2048 tokens, which truncates long prompts and clips reasoning chains. Raise it with OLLAMA_CONTEXT_LENGTH=16384 (or set num_ctx in a Modelfile). Our ollama-command-builder generates the exact pull/run commands and Modelfile for your chosen tag and context length.

The R1 distills are strong at structured reasoning — math, logic puzzles, and step-by-step debugging — because they were trained on R1's chain-of-thought. The trade-off is latency and token cost: a reasoning model emits a long invisible-then-visible 'thinking' block before its answer, so a simple question can burn thousands of extra tokens. For pure code generation and editing, a same-size coding-specialized model (such as a Qwen3-Coder distill) is often faster and tighter — see our companion guide on running Qwen3-Coder locally. Use the R1 distill when the problem genuinely needs reasoning, not for every prompt.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.