What hardware do I actually need to run Qwen3-Coder locally?

For the 30B-A3B model, Unsloth's rule of thumb is ~18GB of combined memory (VRAM + RAM, or 18GB system RAM) for usable speed of 6+ tokens/sec at a 4-bit quant. At Q4_K_M the weights are about 18.7GB. The flagship 480B model is a different universe: roughly 150GB of unified memory or RAM for 6+ tok/s using Unsloth's dynamic UD-Q2_K_XL quant, or ~250GB+ per Ollama — realistic only on a Mac Studio with 512GB unified memory or a multi-GPU server.

Can I run Qwen3-Coder 30B on a 24GB GPU like an RTX 3090/4090?

Yes, at Q4_K_M (~18.7GB) a 24GB card runs the 30B model comfortably with room for some context. What a 24GB card cannot do is hold Q6_K (~25.6GB) or the full 262K-token context at the same time — full context adds up to ~12.8GB on top of the weights (≈31.5GB total at Q4). For higher quants or huge context you need more VRAM or CPU/RAM offload. On a Mac, 32GB unified memory handles the 30B at Q4 and 64GB gives you comfortable context headroom.

Which GGUF quantization should I pick?

Q4_K_M is the standard best size-to-quality tradeoff for Ollama and LM Studio. If you want a bit more quality at similar size, Unsloth's dynamic UD-Q4_K_XL for the 30B reportedly nearly matched full BF16 quality in third-party testing and is their recommended pick. Going up to Q5_K_S (~21.4GB) or Q6_K (~25.6GB) costs more VRAM for diminishing returns. Sub-4-bit quants like IQ2_XXS shrink it to ~8.8GB but with noticeable quality loss.

How do I install and run Qwen3-Coder with Ollama?

Install Ollama, then run "ollama pull qwen3-coder:30b" followed by "ollama run qwen3-coder:30b". Ollama handles the GGUF download and quantization automatically and exposes a local API at localhost:11434. The 30B tag is the default (also "qwen3-coder:latest"), around a 19GB download. You can also pull Unsloth's dynamic quant directly with "ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL".

How do I increase Ollama's context window for coding?

Ollama's default context is small — 2048 tokens by default (newer builds may raise this for some models). For agentic coding you want at least ~64000. Set it via the OLLAMA_CONTEXT_LENGTH environment variable, a Modelfile "PARAMETER num_ctx", a per-run override, or a UI slider. Note that a Modelfile num_ctx takes precedence over the env var. Larger context costs more memory, so balance it against your VRAM.

Why isn't tool calling / file editing working?

Agentic coding needs function/tool calling, which depends on the correct chat template. For llama.cpp or llama-server you MUST pass --jinja or tool calling silently fails. Unsloth's current 30B GGUF uploads already bundle the chat-template and tool-calling fixes (resolved around August 2025). The official Ollama qwen3-coder model is built for agentic/tool use, but older notes reported Ollama struggled with Qwen tool-call templates — so verify tool calling actually works on your Ollama version before trusting it.

How do I connect Qwen Code CLI to my local model?

The simple method is environment variables: set OPENAI_API_KEY to any non-empty string (it is local), OPENAI_BASE_URL to your endpoint (http://localhost:11434/v1 for Ollama or http://localhost:1234/v1 for LM Studio — the /v1 suffix is required), and OPENAI_MODEL to the local model id. Put them in a project .env, ~/.qwen/.env, or export them. Newer Qwen Code versions prefer a "modelProviders" block in ~/.qwen/settings.json instead; the exact config differs by version, so check the docs for the build you installed.

Is the local 30B good enough versus the 480B or cloud Qwen-Coder?

The 480B flagship is the stronger model, but the 30B-A3B is genuinely capable for everyday coding and runs on a single consumer GPU thanks to its MoE design (only ~3.3B params activate per token). For most local agentic work the 30B is the practical choice; the 480B is reserved for high-end workstations or Ollama Cloud offload. Expect 6+ tok/s on hardware that meets the minimum memory bar, and faster on a 24GB GPU.

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

The free ride is over. Qwen Code's free Qwen OAuth tier dropped from 1,000 to 100 requests per day in mid-April 2026, and the free OAuth service itself was discontinued shortly after — pushing everyone toward API keys or paid plans. If you were relying on that tier to run an agentic coding loop all day, 100 requests evaporate before lunch.

The good news: Qwen3-Coder ships as open weights, and a 30B-class model now runs on a single consumer GPU. Self-hosting it means no rate limits, no per-token billing, and no code leaving your machine. This guide covers the hardware you need, which quantization to pick, how to run it with Ollama (and the alternatives), and how to wire it into Qwen Code over a local OpenAI-compatible endpoint.

Why the 30B runs on a laptop GPU

Qwen3-Coder comes in two open-weight sizes, both Mixture-of-Experts (MoE):

Qwen3-Coder-30B-A3B-Instruct — 30.5B total params, but only ~3.3B activate per token.
Qwen3-Coder-480B-A35B-Instruct — the 480B flagship, ~35B activated per token.

The MoE architecture is the whole trick. Even though the 30B has 30 billion parameters sitting in memory, each token only routes through ~3.3B of them, so inference is fast relative to the model's size. That's why a "30B" model runs efficiently on hardware that would choke on a dense 30B. Both variants support a native 256K (262K) token context window, extendable toward ~1M via YaRN — Unsloth even publishes dedicated 1M-context GGUF builds.

Hardware requirements

Here's the part everyone actually scrolls to. VRAM figures below are for Qwen3-Coder-30B-A3B; the 480B numbers are order-of-magnitude.

Setup	Memory needed	What runs	Notes
30B @ IQ2_XXS	~8.8 GB	30B (minimal quality)	Last-resort tiny quant
30B @ Q4_K_M	~18.7 GB	30B (recommended)	Best size/quality tradeoff
30B @ Q5_K_S	~21.4 GB	30B	Marginal quality gain
30B @ Q6_K	~25.6 GB	30B	Exceeds a 24GB 4090
30B @ BF16	~61.5 GB	30B (full precision)	Workstation/server only
30B @ Q4 + full 262K context	~31.5 GB	30B + max context	Context alone adds up to ~12.8 GB
480B @ UD-Q2_K_XL	~150 GB	480B (6+ tok/s)	Mac Studio 512GB / multi-GPU
480B (Ollama)	~250 GB+	480B	High-end only

A few practical readings of this table:

Minimum usable bar (30B): ~18GB of combined VRAM+RAM (or 18GB system RAM) for 6+ tok/s at a 4-bit quant.
A 24GB GPU (RTX 3090/4090): comfortably runs 30B at Q4_K_M with some context. It cannot hold Q6_K or the full 262K context simultaneously without offloading (e.g. llama.cpp -ot to push experts onto CPU).
Apple Silicon: 32GB unified memory handles 30B at Q4; 64GB gives you real context headroom.

Don't want to do the arithmetic by hand? The LLM VRAM calculator sizes weights plus KV cache for a given quant and context length and tells you whether it fits one card, needs multi-GPU, or fits Apple Silicon unified memory. If you'd rather start from your hardware, what LLM can I run detects your GPU and lists the models — and quants — that fit.

If you take one number away: ~18GB of memory gets you a working local Qwen3-Coder 30B. Everything above that buys you bigger context windows or marginal quality, not a different model.

Pick your runtime

There are three mainstream ways to self-host, all of which expose an OpenAI-compatible endpoint that Qwen Code, Cline, and Continue can talk to.

Runtime	Best for	Endpoint	Watch out
Ollama	Easiest setup, auto quant	`localhost:11434/v1`	Default context is tiny
LM Studio	GUI, MLX on Mac	`localhost:1234/v1`	Click "Start Server" to expose API
llama.cpp	Maximum control	your `--port`	Must pass `--jinja` for tools

Option 1: Ollama (recommended for most)

Ollama is the path of least resistance. Install it, then:

ollama pull qwen3-coder:30b
ollama run qwen3-coder:30b

Ollama auto-handles the GGUF download (~19GB for the 30B), the quantization, and serves a local API at localhost:11434. The qwen3-coder:30b tag is the default latest. If you want Unsloth's dynamic quant for slightly better quality at the same size, pull it straight from Hugging Face:

ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL

Critical: fix the context window. Ollama's default is 2048 tokens by default (newer builds may raise this for some models) — useless for an agent reading multiple files. Bump it to at least ~64000 via one of:

OLLAMA_CONTEXT_LENGTH environment variable
A Modelfile PARAMETER num_ctx 64000 (this takes precedence over the env var)
A per-run override or UI slider

Option 2: LM Studio (GUI + Mac MLX)

If you prefer a GUI or you're on a Mac and want MLX-optimized inference:

Install from lmstudio.ai.
Search qwen3-coder-30b and pick Qwen3 Coder 30B A3B Instruct (MLX on Mac, GGUF on Windows/Linux).
Load the model.
Click Start Server — it exposes an OpenAI-compatible endpoint at http://localhost:1234/v1.

LM Studio has historically been one of the more reliable options for Qwen tool calling, which matters for agentic use.

Option 3: llama.cpp (maximum control)

For the most control over flags, offload, and context:

./llama.cpp/llama-cli \
  -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL \
  --jinja -ngl 99 --ctx-size 32768 \
  --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05

The --jinja flag is non-negotiable if you want tool/function calling — without it, agentic tool calls silently fail. -ngl 99 offloads all layers to GPU; drop it or use -ot to keep experts on CPU when you're VRAM-constrained. Run llama-server instead of llama-cli to expose an HTTP endpoint.

Sampling settings

Whatever runtime you choose, use the recommended sampling parameters for Qwen3-Coder:

Temperature: 0.7
Top-P:       0.8
Top-K:       20
Min-P:       0.0
Repetition Penalty: 1.05

Generate a Modelfile, the pull/run commands, server-tuning env vars, and the OpenAI-compatible snippet to connect any coding agent to your local model:

Open the Ollama Config Generator toolFree, in your browser on inventivehq.com →

Loading interactive tool...

JavaScript Required

This interactive tool requires JavaScript to function. Please enable JavaScript in your browser to use the full features.

The tool description and documentation above provide information about this tool's capabilities. For the best experience, please enable JavaScript and refresh the page.

Wire it into Qwen Code

Install the CLI (requires Node.js):

npm install -g @qwen-code/qwen-code@latest

Qwen Code is an open-source terminal coding agent (a Gemini-CLI fork adapted for Qwen). The simple wiring method is environment variables — set these in a project .env, ~/.qwen/.env, or your shell:

export OPENAI_API_KEY="local"                    # any non-empty string
export OPENAI_BASE_URL="http://localhost:11434/v1"  # Ollama; LM Studio uses :1234/v1
export OPENAI_MODEL="qwen3-coder:30b"

The /v1 suffix on the base URL is required. For LM Studio, swap the port to 1234 and use the model id LM Studio shows.

Newer Qwen Code versions favor a modelProviders block in ~/.qwen/settings.json instead of the flat OPENAI_* variables — with per-provider envKey, baseUrl, id, and generationConfig fields (an example uses envKey: "OLLAMA_API_KEY" with baseUrl: "http://localhost:11434/v1"). The exact config differs by version, and both methods appear in current docs, so match whichever your installed build expects.

The same OpenAI-compatible endpoint works with Cline, Continue, and other VS Code agents — point them at localhost:11434/v1 (or :1234/v1) and select the local model. You're not locked into Qwen Code.

The 480B reality check

Can you run the 480B flagship locally? Technically yes; practically, only on serious hardware. It needs ~150GB of memory for 6+ tok/s with Unsloth's UD-Q2_K_XL dynamic quant, or ~250GB+ per Ollama's estimate. The download for qwen3-coder:480b is around 290GB. That's a Mac Studio with 512GB unified memory or a multi-GPU server — not a desktop. If you want flagship quality without the hardware, Ollama offers qwen3-coder:480b-cloud to offload to Ollama Cloud, which is a different cost/privacy tradeoff than true self-hosting. If you want to keep running on your own hardware but still have a managed fallback, Wide Area AI is an OpenAI-compatible gateway that routes inference to your own nodes first and only fails over to cloud providers when local hardware is unavailable — so your CLI points at one base URL and requests served locally cost nothing per token.

Bottom line

With the free Qwen OAuth tier gone, self-hosting Qwen3-Coder is the obvious move for anyone running agentic coding loops at volume — no rate limits, no per-token bills, no code leaving your box. The 30B-A3B model is the sweet spot: ~18GB of memory and it runs, a 24GB GPU runs it well at Q4_K_M, and Apple Silicon with 32GB+ handles it via MLX. Use Ollama for the fastest start, bump the context window past the tiny default, remember --jinja if you go the llama.cpp route for tool calling, and point Qwen Code (or Cline/Continue) at localhost:11434/v1. The 480B flagship stays in workstation/cloud territory — but for daily coding, the local 30B is more than enough.

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

Why the 30B runs on a laptop GPU

Hardware requirements

Pick your runtime

Option 1: Ollama (recommended for most)

Option 2: LM Studio (GUI + Mac MLX)

Option 3: llama.cpp (maximum control)

Sampling settings

Wire it into Qwen Code

The 480B reality check

Bottom line

Frequently Asked Questions

Is Qwen Code Still Free? The 2026 Free-Tier Shutdown (and 3 Free Alternatives)

Qwen Code API Key Setup: Model Studio, OpenRouter & Local — Without the 401 Errors

Best Practices for AI Coding CLIs in Production

Run Qwen3-Coder 100% Locally with Ollama (Hardware Requirements + Step-by-Step)

Frequently Asked Questions

Free tools you can use right now

Related articles

Is Qwen Code Still Free? The 2026 Free-Tier Shutdown (and 3 Free Alternatives)

Qwen Code API Key Setup: Model Studio, OpenRouter & Local — Without the 401 Errors

Best Practices for AI Coding CLIs in Production