Skip to main content
Home/Blog/Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)
Developer Tools

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

A practical, engineer-to-engineer guide to self-hosting Qwen3-Coder with Ollama, LM Studio, or llama.cpp — including a hardware/VRAM table, quantization picks, and how to wire it into Qwen Code over an OpenAI-compatible endpoint.

By Sean

The free ride is over. Qwen Code's free Qwen OAuth tier dropped from 1,000 to 100 requests per day in mid-April 2026, and the free OAuth service itself was discontinued shortly after — pushing everyone toward API keys or paid plans. If you were relying on that tier to run an agentic coding loop all day, 100 requests evaporate before lunch.

The good news: Qwen3-Coder ships as open weights, and a 30B-class model now runs on a single consumer GPU. Self-hosting it means no rate limits, no per-token billing, and no code leaving your machine. This guide covers the hardware you need, which quantization to pick, how to run it with Ollama (and the alternatives), and how to wire it into Qwen Code over a local OpenAI-compatible endpoint.

Why the 30B runs on a laptop GPU

Qwen3-Coder comes in two open-weight sizes, both Mixture-of-Experts (MoE):

  • Qwen3-Coder-30B-A3B-Instruct — 30.5B total params, but only ~3.3B activate per token.
  • Qwen3-Coder-480B-A35B-Instruct — the 480B flagship, ~35B activated per token.

The MoE architecture is the whole trick. Even though the 30B has 30 billion parameters sitting in memory, each token only routes through ~3.3B of them, so inference is fast relative to the model's size. That's why a "30B" model runs efficiently on hardware that would choke on a dense 30B. Both variants support a native 256K (262K) token context window, extendable toward ~1M via YaRN — Unsloth even publishes dedicated 1M-context GGUF builds.

Hardware requirements

Here's the part everyone actually scrolls to. VRAM figures below are for Qwen3-Coder-30B-A3B; the 480B numbers are order-of-magnitude.

SetupMemory neededWhat runsNotes
30B @ IQ2_XXS~8.8 GB30B (minimal quality)Last-resort tiny quant
30B @ Q4_K_M~18.7 GB30B (recommended)Best size/quality tradeoff
30B @ Q5_K_S~21.4 GB30BMarginal quality gain
30B @ Q6_K~25.6 GB30BExceeds a 24GB 4090
30B @ BF16~61.5 GB30B (full precision)Workstation/server only
30B @ Q4 + full 262K context~31.5 GB30B + max contextContext alone adds up to ~12.8 GB
480B @ UD-Q2_K_XL~150 GB480B (6+ tok/s)Mac Studio 512GB / multi-GPU
480B (Ollama)~250 GB+480BHigh-end only

A few practical readings of this table:

  • Minimum usable bar (30B): ~18GB of combined VRAM+RAM (or 18GB system RAM) for 6+ tok/s at a 4-bit quant.
  • A 24GB GPU (RTX 3090/4090): comfortably runs 30B at Q4_K_M with some context. It cannot hold Q6_K or the full 262K context simultaneously without offloading (e.g. llama.cpp -ot to push experts onto CPU).
  • Apple Silicon: 32GB unified memory handles 30B at Q4; 64GB gives you real context headroom.

If you take one number away: ~18GB of memory gets you a working local Qwen3-Coder 30B. Everything above that buys you bigger context windows or marginal quality, not a different model.

Pick your runtime

There are three mainstream ways to self-host, all of which expose an OpenAI-compatible endpoint that Qwen Code, Cline, and Continue can talk to.

RuntimeBest forEndpointWatch out
OllamaEasiest setup, auto quantlocalhost:11434/v1Default context is tiny
LM StudioGUI, MLX on Maclocalhost:1234/v1Click "Start Server" to expose API
llama.cppMaximum controlyour --portMust pass --jinja for tools

Ollama is the path of least resistance. Install it, then:

ollama pull qwen3-coder:30b
ollama run qwen3-coder:30b

Ollama auto-handles the GGUF download (~19GB for the 30B), the quantization, and serves a local API at localhost:11434. The qwen3-coder:30b tag is the default latest. If you want Unsloth's dynamic quant for slightly better quality at the same size, pull it straight from Hugging Face:

ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL

Critical: fix the context window. Ollama's default is commonly 4096 tokens — useless for an agent reading multiple files. Bump it to at least ~64000 via one of:

  • OLLAMA_CONTEXT_LENGTH environment variable
  • A Modelfile PARAMETER num_ctx 64000 (this takes precedence over the env var)
  • A per-run override or UI slider

Option 2: LM Studio (GUI + Mac MLX)

If you prefer a GUI or you're on a Mac and want MLX-optimized inference:

  1. Install from lmstudio.ai.
  2. Search qwen3-coder-30b and pick Qwen3 Coder 30B A3B Instruct (MLX on Mac, GGUF on Windows/Linux).
  3. Load the model.
  4. Click Start Server — it exposes an OpenAI-compatible endpoint at http://localhost:1234/v1.

LM Studio has historically been one of the more reliable options for Qwen tool calling, which matters for agentic use.

Option 3: llama.cpp (maximum control)

For the most control over flags, offload, and context:

./llama.cpp/llama-cli \
  -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
  --jinja -ngl 99 --ctx-size 32768 \
  --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05

The --jinja flag is non-negotiable if you want tool/function calling — without it, agentic tool calls silently fail. -ngl 99 offloads all layers to GPU; drop it or use -ot to keep experts on CPU when you're VRAM-constrained. Run llama-server instead of llama-cli to expose an HTTP endpoint.

Sampling settings

Whatever runtime you choose, use the recommended sampling parameters for Qwen3-Coder:

Temperature: 0.7
Top-P:       0.8
Top-K:       20
Min-P:       0.0
Repetition Penalty: 1.05

Wire it into Qwen Code

Install the CLI (requires Node.js):

npm install -g @qwen-code/qwen-code@latest

Qwen Code is an open-source terminal coding agent (a Gemini-CLI fork adapted for Qwen). The simple wiring method is environment variables — set these in a project .env, ~/.qwen/.env, or your shell:

export OPENAI_API_KEY="local"                    # any non-empty string
export OPENAI_BASE_URL="http://localhost:11434/v1"  # Ollama; LM Studio uses :1234/v1
export OPENAI_MODEL="qwen3-coder:30b"

The /v1 suffix on the base URL is required. For LM Studio, swap the port to 1234 and use the model id LM Studio shows.

Newer Qwen Code versions favor a modelProviders block in ~/.qwen/settings.json instead of the flat OPENAI_* variables — with per-provider envKey, baseUrl, id, and generationConfig fields (an example uses envKey: "OLLAMA_API_KEY" with baseUrl: "http://localhost:11434/v1"). The exact config differs by version, and both methods appear in current docs, so match whichever your installed build expects.

The same OpenAI-compatible endpoint works with Cline, Continue, and other VS Code agents — point them at localhost:11434/v1 (or :1234/v1) and select the local model. You're not locked into Qwen Code.

The 480B reality check

Can you run the 480B flagship locally? Technically yes; practically, only on serious hardware. It needs ~150GB of memory for 6+ tok/s with Unsloth's UD-Q2_K_XL dynamic quant, or ~250GB+ per Ollama's estimate. The download for qwen3-coder:480b is around 290GB. That's a Mac Studio with 512GB unified memory or a multi-GPU server — not a desktop. If you want flagship quality without the hardware, Ollama offers qwen3-coder:480b-cloud to offload to Ollama Cloud, which is a different cost/privacy tradeoff than true self-hosting. If you'd rather keep a local-first setup but still have a managed fallback, Wide Area AI is an OpenAI-compatible gateway that routes inference to your own nodes first and only fails over to cloud providers when local hardware is unavailable — so your CLI points at one base URL and requests served locally cost nothing per token.

Bottom line

With the free Qwen OAuth tier gone, self-hosting Qwen3-Coder is the obvious move for anyone running agentic coding loops at volume — no rate limits, no per-token bills, no code leaving your box. The 30B-A3B model is the sweet spot: ~18GB of memory and it runs, a 24GB GPU runs it well at Q4_K_M, and Apple Silicon with 32GB+ handles it via MLX. Use Ollama for the fastest start, bump the context window past the tiny default, remember --jinja if you go the llama.cpp route for tool calling, and point Qwen Code (or Cline/Continue) at localhost:11434/v1. The 480B flagship stays in workstation/cloud territory — but for daily coding, the local 30B is more than enough.

Frequently Asked Questions

Find answers to common questions

For the 30B-A3B model, Unsloth's rule of thumb is ~18GB of combined memory (VRAM + RAM, or 18GB system RAM) for usable speed of 6+ tokens/sec at a 4-bit quant. At Q4_K_M the weights are about 18.7GB. The flagship 480B model is a different universe: roughly 150GB of unified memory or RAM for 6+ tok/s using Unsloth's dynamic UD-Q2_K_XL quant, or ~250GB+ per Ollama — realistic only on a Mac Studio with 512GB unified memory or a multi-GPU server.

Yes, at Q4_K_M (~18.7GB) a 24GB card runs the 30B model comfortably with room for some context. What a 24GB card cannot do is hold Q6_K (~25.6GB) or the full 262K-token context at the same time — full context adds up to ~12.8GB on top of the weights (≈31.5GB total at Q4). For higher quants or huge context you need more VRAM or CPU/RAM offload. On a Mac, 32GB unified memory handles the 30B at Q4 and 64GB gives you comfortable context headroom.

Q4_K_M is the standard best size-to-quality tradeoff for Ollama and LM Studio. If you want a bit more quality at similar size, Unsloth's dynamic UD-Q4_K_XL for the 30B reportedly nearly matched full BF16 quality in third-party testing and is their recommended pick. Going up to Q5_K_S (~21.4GB) or Q6_K (~25.6GB) costs more VRAM for diminishing returns. Sub-4-bit quants like IQ2_XXS shrink it to ~8.8GB but with noticeable quality loss.

Install Ollama, then run "ollama pull qwen3-coder:30b" followed by "ollama run qwen3-coder:30b". Ollama handles the GGUF download and quantization automatically and exposes a local API at localhost:11434. The 30B tag is the default (also "qwen3-coder:latest"), around a 19GB download. You can also pull Unsloth's dynamic quant directly with "ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL".

Ollama's default context is small — commonly 4096 tokens, though newer versions scale the default by available VRAM. For agentic coding you want at least ~64000. Set it via the OLLAMA_CONTEXT_LENGTH environment variable, a Modelfile "PARAMETER num_ctx", a per-run override, or a UI slider. Note that a Modelfile num_ctx takes precedence over the env var. Larger context costs more memory, so balance it against your VRAM.

Agentic coding needs function/tool calling, which depends on the correct chat template. For llama.cpp or llama-server you MUST pass --jinja or tool calling silently fails. Unsloth's current 30B GGUF uploads already bundle the chat-template and tool-calling fixes (resolved around August 2025). The official Ollama qwen3-coder model is built for agentic/tool use, but older notes reported Ollama struggled with Qwen tool-call templates — so verify tool calling actually works on your Ollama version before trusting it.

The simple method is environment variables: set OPENAI_API_KEY to any non-empty string (it is local), OPENAI_BASE_URL to your endpoint (http://localhost:11434/v1 for Ollama or http://localhost:1234/v1 for LM Studio — the /v1 suffix is required), and OPENAI_MODEL to the local model id. Put them in a project .env, ~/.qwen/.env, or export them. Newer Qwen Code versions prefer a "modelProviders" block in ~/.qwen/settings.json instead; the exact config differs by version, so check the docs for the build you installed.

The 480B flagship is the stronger model, but the 30B-A3B is genuinely capable for everyday coding and runs on a single consumer GPU thanks to its MoE design (only ~3.3B params activate per token). For most local agentic work the 30B is the practical choice; the 480B is reserved for high-end workstations or Ollama Cloud offload. Expect 6+ tok/s on hardware that meets the minimum memory bar, and faster on a 24GB GPU.

Building Something Great?

Our development team builds secure, scalable applications. From APIs to full platforms, we turn your ideas into production-ready software.