Qwen2.5-Coder is one of the strongest open-weight coding models, and it runs well on consumer hardware. The trick to actually using it from your existing tools is to serve it behind the OpenAI-compatible API. Once Qwen-Coder answers at /v1/chat/completions, any OpenAI-compatible client, Codex CLI, Aider, Cline, or the OpenAI SDK, can use it by changing a single base URL. This guide shows two ways to do that and how to connect your CLI.
Why an OpenAI-Compatible Endpoint
The OpenAI API has become the de facto contract for LLM tooling. Rather than rewrite your workflow around a new model, you wrap the model in that contract:
- No tool changes. Codex CLI keeps working; only the base URL changes.
- Private by default. Code never leaves your machine or network.
- No per-token cost. After the hardware, routine inference is free.
- Drop-in swapping. The same client can target local Qwen today and a gateway tomorrow.
Both Ollama and llama.cpp expose this contract for you, so most of the work is choosing a model size and starting a server.
Choose a Model Size
Pick the largest variant your hardware runs comfortably:
| Model | Approx. size | VRAM / unified memory | Best for |
|---|---|---|---|
| qwen2.5-coder:7b | ~4.5GB | 8GB | Routine edits, explanations |
| qwen2.5-coder:14b | ~9GB | 16GB | Balanced quality and speed |
| qwen2.5-coder:32b | ~20GB | 24GB+ | Complex tasks, near cloud quality |
For most developers the 14B variant is the sweet spot: good code quality at interactive speed on a 16GB GPU or a 32GB Apple Silicon Mac.
Method 1: Ollama (Simplest)
Ollama downloads, quantizes, and serves the model, and it exposes an OpenAI-compatible API automatically.
Install and Pull
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Pull a Qwen-Coder model:
ollama pull qwen2.5-coder:14b
Start the Server
ollama serve
Ollama now serves an OpenAI-compatible API at http://localhost:11434/v1. Confirm it:
curl http://localhost:11434/v1/models
You should see qwen2.5-coder:14b in the list.
Method 2: llama.cpp (More Control)
llama.cpp gives you fine control over quantization, context length, and GPU offload. Its llama-server binary exposes the OpenAI-compatible routes.
Build or Install
# macOS
brew install llama.cpp
# Or build from source for the latest features
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release
Download a GGUF Model
Grab a quantized Qwen2.5-Coder GGUF (for example a Q4_K_M build) from Hugging Face, then start the server:
llama-server \
-m ./qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 8192 \
-ngl 99 # offload all layers to GPU if you have the VRAM
This exposes an OpenAI-compatible endpoint at http://localhost:8080/v1. Verify:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder",
"messages": [{"role": "user", "content": "write a fizzbuzz in python"}]
}'
Connect Codex CLI to Your Endpoint
With the server running, point Codex at it. Use whichever base URL matches your method (Ollama on 11434, llama.cpp on 8080).
Quick: Environment Variables
export OPENAI_BASE_URL="http://localhost:11434/v1" # or :8080/v1 for llama.cpp
export OPENAI_API_KEY="local" # any non-empty string
codex --model qwen2.5-coder:14b "explain this function"
Persistent: A Custom Provider
Define a provider in ~/.codex/config.toml so you can switch on demand:
# ~/.codex/config.toml
[model_providers.qwen-local]
name = "Local Qwen-Coder"
base_url = "http://localhost:11434/v1"
env_key = "LOCAL_API_KEY"
[profiles.qwen]
model_provider = "qwen-local"
model = "qwen2.5-coder:14b"
export LOCAL_API_KEY="local"
codex --profile qwen "add error handling to this file"
The same base URL works for Aider (--openai-api-base), Cline (custom base URL in settings), and the OpenAI SDK, because they all share the contract.
Serving Beyond localhost
A model on localhost only helps the machine it runs on. Teams usually want one GPU to serve several developers, which means exposing the endpoint on the network. Do this carefully:
- Bind to your LAN, not the public internet. Set
--hostto the interface you intend, and keep the server behind your firewall or VPN. - Add authentication. A raw inference server has no auth; put a reverse proxy or gateway in front to require a token.
- Add TLS. Terminate HTTPS at a proxy so credentials are not sent in the clear.
This is where an edge-first gateway earns its place. Instead of every developer hard-coding a LAN IP, they point Codex at one stable OpenAI-compatible URL. The gateway handles TLS and auth, caches repeated prompts at the edge, serves eligible requests from the Qwen-Coder GPU you own (free), and bursts to a cloud model only when the local box is busy, offline, or a task needs more capability. Wide Area AI provides exactly this layer, turning a single local Qwen server into a resilient, shareable endpoint without changing anything on the client side.
Troubleshooting
Codex Cannot Connect
- Confirm the server is up:
curl http://localhost:11434/v1/models. - Ensure your base URL includes the
/v1suffix. - Match the port to your method (11434 for Ollama, your chosen port for llama.cpp).
Model Not Found
- List served models with the
/v1/modelscall. - Use the exact ID, including the version tag (
qwen2.5-coder:14b). - For Ollama, pull the model first:
ollama pull qwen2.5-coder:14b.
Responses Are Slow
- Offload more layers to the GPU (
-nglin llama.cpp) or confirm Ollama is using the GPU. - Drop to a smaller variant (7B) or a more aggressive quantization (Q4_K_M).
- Reduce the context window if you set it very large.
Out of Memory
- Use a smaller model or a more aggressive quantization.
- Lower the context size (
-cin llama.cpp). - Close other GPU-heavy applications.