The free ride is over. Qwen Code's free Qwen OAuth tier dropped from 1,000 to 100 requests per day in mid-April 2026, and the free OAuth service itself was discontinued shortly after — pushing everyone toward API keys or paid plans. If you were relying on that tier to run an agentic coding loop all day, 100 requests evaporate before lunch.
The good news: Qwen3-Coder ships as open weights, and a 30B-class model now runs on a single consumer GPU. Self-hosting it means no rate limits, no per-token billing, and no code leaving your machine. This guide covers the hardware you need, which quantization to pick, how to run it with Ollama (and the alternatives), and how to wire it into Qwen Code over a local OpenAI-compatible endpoint.
Why the 30B runs on a laptop GPU
Qwen3-Coder comes in two open-weight sizes, both Mixture-of-Experts (MoE):
- Qwen3-Coder-30B-A3B-Instruct — 30.5B total params, but only ~3.3B activate per token.
- Qwen3-Coder-480B-A35B-Instruct — the 480B flagship, ~35B activated per token.
The MoE architecture is the whole trick. Even though the 30B has 30 billion parameters sitting in memory, each token only routes through ~3.3B of them, so inference is fast relative to the model's size. That's why a "30B" model runs efficiently on hardware that would choke on a dense 30B. Both variants support a native 256K (262K) token context window, extendable toward ~1M via YaRN — Unsloth even publishes dedicated 1M-context GGUF builds.
Hardware requirements
Here's the part everyone actually scrolls to. VRAM figures below are for Qwen3-Coder-30B-A3B; the 480B numbers are order-of-magnitude.
| Setup | Memory needed | What runs | Notes |
|---|---|---|---|
| 30B @ IQ2_XXS | ~8.8 GB | 30B (minimal quality) | Last-resort tiny quant |
| 30B @ Q4_K_M | ~18.7 GB | 30B (recommended) | Best size/quality tradeoff |
| 30B @ Q5_K_S | ~21.4 GB | 30B | Marginal quality gain |
| 30B @ Q6_K | ~25.6 GB | 30B | Exceeds a 24GB 4090 |
| 30B @ BF16 | ~61.5 GB | 30B (full precision) | Workstation/server only |
| 30B @ Q4 + full 262K context | ~31.5 GB | 30B + max context | Context alone adds up to ~12.8 GB |
| 480B @ UD-Q2_K_XL | ~150 GB | 480B (6+ tok/s) | Mac Studio 512GB / multi-GPU |
| 480B (Ollama) | ~250 GB+ | 480B | High-end only |
A few practical readings of this table:
- Minimum usable bar (30B): ~18GB of combined VRAM+RAM (or 18GB system RAM) for 6+ tok/s at a 4-bit quant.
- A 24GB GPU (RTX 3090/4090): comfortably runs 30B at Q4_K_M with some context. It cannot hold Q6_K or the full 262K context simultaneously without offloading (e.g. llama.cpp
-otto push experts onto CPU). - Apple Silicon: 32GB unified memory handles 30B at Q4; 64GB gives you real context headroom.
If you take one number away: ~18GB of memory gets you a working local Qwen3-Coder 30B. Everything above that buys you bigger context windows or marginal quality, not a different model.
Pick your runtime
There are three mainstream ways to self-host, all of which expose an OpenAI-compatible endpoint that Qwen Code, Cline, and Continue can talk to.
| Runtime | Best for | Endpoint | Watch out |
|---|---|---|---|
| Ollama | Easiest setup, auto quant | localhost:11434/v1 | Default context is tiny |
| LM Studio | GUI, MLX on Mac | localhost:1234/v1 | Click "Start Server" to expose API |
| llama.cpp | Maximum control | your --port | Must pass --jinja for tools |
Option 1: Ollama (recommended for most)
Ollama is the path of least resistance. Install it, then:
ollama pull qwen3-coder:30b
ollama run qwen3-coder:30b
Ollama auto-handles the GGUF download (~19GB for the 30B), the quantization, and serves a local API at localhost:11434. The qwen3-coder:30b tag is the default latest. If you want Unsloth's dynamic quant for slightly better quality at the same size, pull it straight from Hugging Face:
ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
Critical: fix the context window. Ollama's default is commonly 4096 tokens — useless for an agent reading multiple files. Bump it to at least ~64000 via one of:
OLLAMA_CONTEXT_LENGTHenvironment variable- A Modelfile
PARAMETER num_ctx 64000(this takes precedence over the env var) - A per-run override or UI slider
Option 2: LM Studio (GUI + Mac MLX)
If you prefer a GUI or you're on a Mac and want MLX-optimized inference:
- Install from lmstudio.ai.
- Search
qwen3-coder-30band pick Qwen3 Coder 30B A3B Instruct (MLX on Mac, GGUF on Windows/Linux). - Load the model.
- Click Start Server — it exposes an OpenAI-compatible endpoint at
http://localhost:1234/v1.
LM Studio has historically been one of the more reliable options for Qwen tool calling, which matters for agentic use.
Option 3: llama.cpp (maximum control)
For the most control over flags, offload, and context:
./llama.cpp/llama-cli \
-hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
--jinja -ngl 99 --ctx-size 32768 \
--temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05
The --jinja flag is non-negotiable if you want tool/function calling — without it, agentic tool calls silently fail. -ngl 99 offloads all layers to GPU; drop it or use -ot to keep experts on CPU when you're VRAM-constrained. Run llama-server instead of llama-cli to expose an HTTP endpoint.
Sampling settings
Whatever runtime you choose, use the recommended sampling parameters for Qwen3-Coder:
Temperature: 0.7
Top-P: 0.8
Top-K: 20
Min-P: 0.0
Repetition Penalty: 1.05
Wire it into Qwen Code
Install the CLI (requires Node.js):
npm install -g @qwen-code/qwen-code@latest
Qwen Code is an open-source terminal coding agent (a Gemini-CLI fork adapted for Qwen). The simple wiring method is environment variables — set these in a project .env, ~/.qwen/.env, or your shell:
export OPENAI_API_KEY="local" # any non-empty string
export OPENAI_BASE_URL="http://localhost:11434/v1" # Ollama; LM Studio uses :1234/v1
export OPENAI_MODEL="qwen3-coder:30b"
The /v1 suffix on the base URL is required. For LM Studio, swap the port to 1234 and use the model id LM Studio shows.
Newer Qwen Code versions favor a modelProviders block in ~/.qwen/settings.json instead of the flat OPENAI_* variables — with per-provider envKey, baseUrl, id, and generationConfig fields (an example uses envKey: "OLLAMA_API_KEY" with baseUrl: "http://localhost:11434/v1"). The exact config differs by version, and both methods appear in current docs, so match whichever your installed build expects.
The same OpenAI-compatible endpoint works with Cline, Continue, and other VS Code agents — point them at localhost:11434/v1 (or :1234/v1) and select the local model. You're not locked into Qwen Code.
The 480B reality check
Can you run the 480B flagship locally? Technically yes; practically, only on serious hardware. It needs ~150GB of memory for 6+ tok/s with Unsloth's UD-Q2_K_XL dynamic quant, or ~250GB+ per Ollama's estimate. The download for qwen3-coder:480b is around 290GB. That's a Mac Studio with 512GB unified memory or a multi-GPU server — not a desktop. If you want flagship quality without the hardware, Ollama offers qwen3-coder:480b-cloud to offload to Ollama Cloud, which is a different cost/privacy tradeoff than true self-hosting. If you'd rather keep a local-first setup but still have a managed fallback, Wide Area AI is an OpenAI-compatible gateway that routes inference to your own nodes first and only fails over to cloud providers when local hardware is unavailable — so your CLI points at one base URL and requests served locally cost nothing per token.
Bottom line
With the free Qwen OAuth tier gone, self-hosting Qwen3-Coder is the obvious move for anyone running agentic coding loops at volume — no rate limits, no per-token bills, no code leaving your box. The 30B-A3B model is the sweet spot: ~18GB of memory and it runs, a 24GB GPU runs it well at Q4_K_M, and Apple Silicon with 32GB+ handles it via MLX. Use Ollama for the fastest start, bump the context window past the tiny default, remember --jinja if you go the llama.cpp route for tool calling, and point Qwen Code (or Cline/Continue) at localhost:11434/v1. The 480B flagship stays in workstation/cloud territory — but for daily coding, the local 30B is more than enough.