Skip to main content
OpenAIadvanced

How to Serve Qwen-Coder as an OpenAI-Compatible Endpoint

Run Qwen2.5-Coder locally with llama.cpp or Ollama and expose it as an OpenAI-compatible API. Point Codex CLI, Aider, or the OpenAI SDK at your own GPU for private, no-cost coding.

11 min readUpdated June 2026

Want us to handle this for you?

Get expert help →

Qwen2.5-Coder is one of the strongest open-weight coding models, and it runs well on consumer hardware. The trick to actually using it from your existing tools is to serve it behind the OpenAI-compatible API. Once Qwen-Coder answers at /v1/chat/completions, any OpenAI-compatible client, Codex CLI, Aider, Cline, or the OpenAI SDK, can use it by changing a single base URL. This guide shows two ways to do that and how to connect your CLI.

Why an OpenAI-Compatible Endpoint

The OpenAI API has become the de facto contract for LLM tooling. Rather than rewrite your workflow around a new model, you wrap the model in that contract:

  • No tool changes. Codex CLI keeps working; only the base URL changes.
  • Private by default. Code never leaves your machine or network.
  • No per-token cost. After the hardware, routine inference is free.
  • Drop-in swapping. The same client can target local Qwen today and a gateway tomorrow.

Both Ollama and llama.cpp expose this contract for you, so most of the work is choosing a model size and starting a server.

Choose a Model Size

Pick the largest variant your hardware runs comfortably:

ModelApprox. sizeVRAM / unified memoryBest for
qwen2.5-coder:7b~4.5GB8GBRoutine edits, explanations
qwen2.5-coder:14b~9GB16GBBalanced quality and speed
qwen2.5-coder:32b~20GB24GB+Complex tasks, near cloud quality

For most developers the 14B variant is the sweet spot: good code quality at interactive speed on a 16GB GPU or a 32GB Apple Silicon Mac.

Method 1: Ollama (Simplest)

Ollama downloads, quantizes, and serves the model, and it exposes an OpenAI-compatible API automatically.

Install and Pull

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Pull a Qwen-Coder model:

ollama pull qwen2.5-coder:14b

Start the Server

ollama serve

Ollama now serves an OpenAI-compatible API at http://localhost:11434/v1. Confirm it:

curl http://localhost:11434/v1/models

You should see qwen2.5-coder:14b in the list.

Method 2: llama.cpp (More Control)

llama.cpp gives you fine control over quantization, context length, and GPU offload. Its llama-server binary exposes the OpenAI-compatible routes.

Build or Install

# macOS
brew install llama.cpp

# Or build from source for the latest features
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release

Download a GGUF Model

Grab a quantized Qwen2.5-Coder GGUF (for example a Q4_K_M build) from Hugging Face, then start the server:

llama-server \
  -m ./qwen2.5-coder-14b-instruct-q4_k_m.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 8192 \
  -ngl 99          # offload all layers to GPU if you have the VRAM

This exposes an OpenAI-compatible endpoint at http://localhost:8080/v1. Verify:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder",
    "messages": [{"role": "user", "content": "write a fizzbuzz in python"}]
  }'

Connect Codex CLI to Your Endpoint

With the server running, point Codex at it. Use whichever base URL matches your method (Ollama on 11434, llama.cpp on 8080).

Quick: Environment Variables

export OPENAI_BASE_URL="http://localhost:11434/v1"   # or :8080/v1 for llama.cpp
export OPENAI_API_KEY="local"                        # any non-empty string
codex --model qwen2.5-coder:14b "explain this function"

Persistent: A Custom Provider

Define a provider in ~/.codex/config.toml so you can switch on demand:

# ~/.codex/config.toml

[model_providers.qwen-local]
name = "Local Qwen-Coder"
base_url = "http://localhost:11434/v1"
env_key = "LOCAL_API_KEY"

[profiles.qwen]
model_provider = "qwen-local"
model = "qwen2.5-coder:14b"
export LOCAL_API_KEY="local"
codex --profile qwen "add error handling to this file"

The same base URL works for Aider (--openai-api-base), Cline (custom base URL in settings), and the OpenAI SDK, because they all share the contract.

Serving Beyond localhost

A model on localhost only helps the machine it runs on. Teams usually want one GPU to serve several developers, which means exposing the endpoint on the network. Do this carefully:

  • Bind to your LAN, not the public internet. Set --host to the interface you intend, and keep the server behind your firewall or VPN.
  • Add authentication. A raw inference server has no auth; put a reverse proxy or gateway in front to require a token.
  • Add TLS. Terminate HTTPS at a proxy so credentials are not sent in the clear.

This is where an edge-first gateway earns its place. Instead of every developer hard-coding a LAN IP, they point Codex at one stable OpenAI-compatible URL. The gateway handles TLS and auth, caches repeated prompts at the edge, serves eligible requests from the Qwen-Coder GPU you own (free), and bursts to a cloud model only when the local box is busy, offline, or a task needs more capability. Wide Area AI provides exactly this layer, turning a single local Qwen server into a resilient, shareable endpoint without changing anything on the client side.

Troubleshooting

Codex Cannot Connect

  1. Confirm the server is up: curl http://localhost:11434/v1/models.
  2. Ensure your base URL includes the /v1 suffix.
  3. Match the port to your method (11434 for Ollama, your chosen port for llama.cpp).

Model Not Found

  1. List served models with the /v1/models call.
  2. Use the exact ID, including the version tag (qwen2.5-coder:14b).
  3. For Ollama, pull the model first: ollama pull qwen2.5-coder:14b.

Responses Are Slow

  1. Offload more layers to the GPU (-ngl in llama.cpp) or confirm Ollama is using the GPU.
  2. Drop to a smaller variant (7B) or a more aggressive quantization (Q4_K_M).
  3. Reduce the context window if you set it very large.

Out of Memory

  1. Use a smaller model or a more aggressive quantization.
  2. Lower the context size (-c in llama.cpp).
  3. Close other GPU-heavy applications.

Next Steps

Running models locally?

Turn your GPU into an OpenAI-compatible endpoint

Wide Area Intelligence is an edge-first AI gateway — it serves the GPU you already own over a Cloudflare Tunnel as an OpenAI-compatible endpoint, edge-caches repeated requests, and bursts to the cloud only when your node is offline. Works with any OpenAI SDK.

Start routing — free

Frequently Asked Questions

Find answers to common questions

Because tools like Codex CLI, Aider, and Cline already speak the OpenAI API. Exposing Qwen-Coder behind /v1/chat/completions lets those tools use a model running on your own GPU with no code changes, just a different base URL. You get privacy and zero per-token cost.

qwen2.5-coder:7b runs on a GPU with 8GB VRAM or a 16GB Mac and handles routine edits. The 14B variant needs about 16GB and gives noticeably better results. The 32B variant approaches cloud quality but wants 24GB+ VRAM or a 64GB Apple Silicon machine.

Yes. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 with no extra setup. llama.cpp's llama-server exposes /v1/chat/completions when you run it as a server. Both accept the standard OpenAI request format.

Bind the server to your LAN interface and put authentication in front of it, or front it with an edge gateway that handles TLS, auth, and failover. Never expose an unauthenticated inference server directly to the public internet.