Why serve Qwen-Coder as an OpenAI-compatible endpoint?

Because tools like Codex CLI, Aider, and Cline already speak the OpenAI API. Exposing Qwen-Coder behind /v1/chat/completions lets those tools use a model running on your own GPU with no code changes, just a different base URL. You get privacy and zero per-token cost.

Which Qwen-Coder size should I run?

qwen2.5-coder:7b runs on a GPU with 8GB VRAM or a 16GB Mac and handles routine edits. The 14B variant needs about 16GB and gives noticeably better results. The 32B variant approaches cloud quality but wants 24GB+ VRAM or a 64GB Apple Silicon machine.

Does llama.cpp or Ollama provide the OpenAI API automatically?

Yes. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 with no extra setup. llama.cpp's llama-server exposes /v1/chat/completions when you run it as a server. Both accept the standard OpenAI request format.

How do I expose my local endpoint to other machines safely?

Bind the server to your LAN interface and put authentication in front of it, or front it with an edge gateway that handles TLS, auth, and failover. Never expose an unauthenticated inference server directly to the public internet.

How to Serve Qwen-Coder as an OpenAI-Compatible Endpoint

Qwen2.5-Coder is one of the strongest open-weight coding models, and it runs well on consumer hardware. The trick to actually using it from your existing tools is to serve it behind the OpenAI-compatible API. Once Qwen-Coder answers at /v1/chat/completions, any OpenAI-compatible client, Codex CLI, Aider, Cline, or the OpenAI SDK, can use it by changing a single base URL. This guide shows two ways to do that and how to connect your CLI.

Why an OpenAI-Compatible Endpoint

The OpenAI API has become the de facto contract for LLM tooling. Rather than rewrite your workflow around a new model, you wrap the model in that contract:

No tool changes. Codex CLI keeps working; only the base URL changes.
Private by default. Code never leaves your machine or network.
No per-token cost. After the hardware, routine inference is free.
Drop-in swapping. The same client can target local Qwen today and a gateway tomorrow.

Both Ollama and llama.cpp expose this contract for you, so most of the work is choosing a model size and starting a server.

Choose a Model Size

Pick the largest variant your hardware runs comfortably:

Model	Approx. size	VRAM / unified memory	Best for
qwen2.5-coder:7b	~4.5GB	8GB	Routine edits, explanations
qwen2.5-coder:14b	~9GB	16GB	Balanced quality and speed
qwen2.5-coder:32b	~20GB	24GB+	Complex tasks, near cloud quality

For most developers the 14B variant is the sweet spot: good code quality at interactive speed on a 16GB GPU or a 32GB Apple Silicon Mac.

Method 1: Ollama (Simplest)

Ollama downloads, quantizes, and serves the model, and it exposes an OpenAI-compatible API automatically.

Install and Pull

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Pull a Qwen-Coder model:

ollama pull qwen2.5-coder:14b

Start the Server

ollama serve

Ollama now serves an OpenAI-compatible API at http://localhost:11434/v1. Confirm it:

curl http://localhost:11434/v1/models

You should see qwen2.5-coder:14b in the list.

Method 2: llama.cpp (More Control)

llama.cpp gives you fine control over quantization, context length, and GPU offload. Its llama-server binary exposes the OpenAI-compatible routes.

Build or Install

# macOS
brew install llama.cpp

# Or build from source for the latest features
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release

Download a GGUF Model

Grab a quantized Qwen2.5-Coder GGUF (for example a Q4_K_M build) from Hugging Face, then start the server:

llama-server \
  -m ./qwen2.5-coder-14b-instruct-q4_k_m.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 8192 \
  -ngl 99          # offload all layers to GPU if you have the VRAM

This exposes an OpenAI-compatible endpoint at http://localhost:8080/v1. Verify:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder",
    "messages": [{"role": "user", "content": "write a fizzbuzz in python"}]
  }'

Connect Codex CLI to Your Endpoint

With the server running, point Codex at it. Use whichever base URL matches your method (Ollama on 11434, llama.cpp on 8080).

Quick: Environment Variables

export OPENAI_BASE_URL="http://localhost:11434/v1"   # or :8080/v1 for llama.cpp
export OPENAI_API_KEY="local"                        # any non-empty string
codex --model qwen2.5-coder:14b "explain this function"

Persistent: A Custom Provider

Define a provider in ~/.codex/config.toml so you can switch on demand:

# ~/.codex/config.toml

[model_providers.qwen-local]
name = "Local Qwen-Coder"
base_url = "http://localhost:11434/v1"
env_key = "LOCAL_API_KEY"

[profiles.qwen]
model_provider = "qwen-local"
model = "qwen2.5-coder:14b"

export LOCAL_API_KEY="local"
codex --profile qwen "add error handling to this file"

The same base URL works for Aider (--openai-api-base), Cline (custom base URL in settings), and the OpenAI SDK, because they all share the contract.

Serving Beyond localhost

A model on localhost only helps the machine it runs on. Teams usually want one GPU to serve several developers, which means exposing the endpoint on the network. Do this carefully:

Bind to your LAN, not the public internet. Set --host to the interface you intend, and keep the server behind your firewall or VPN.
Add authentication. A raw inference server has no auth; put a reverse proxy or gateway in front to require a token.
Add TLS. Terminate HTTPS at a proxy so credentials are not sent in the clear.

This is where an edge-first gateway earns its place. Instead of every developer hard-coding a LAN IP, they point Codex at one stable OpenAI-compatible URL. The gateway handles TLS and auth, caches repeated prompts at the edge, serves eligible requests from the Qwen-Coder GPU you own (free), and bursts to a cloud model only when the local box is busy, offline, or a task needs more capability. Wide Area AI provides exactly this layer, turning a single local Qwen server into a resilient, shareable endpoint without changing anything on the client side.

Troubleshooting

Codex Cannot Connect

Confirm the server is up: curl http://localhost:11434/v1/models.
Ensure your base URL includes the /v1 suffix.
Match the port to your method (11434 for Ollama, your chosen port for llama.cpp).

Model Not Found

List served models with the /v1/models call.
Use the exact ID, including the version tag (qwen2.5-coder:14b).
For Ollama, pull the model first: ollama pull qwen2.5-coder:14b.

Responses Are Slow

Offload more layers to the GPU (-ngl in llama.cpp) or confirm Ollama is using the GPU.
Drop to a smaller variant (7B) or a more aggressive quantization (Q4_K_M).
Reduce the context window if you set it very large.

Out of Memory

Use a smaller model or a more aggressive quantization.
Lower the context size (-c in llama.cpp).
Close other GPU-heavy applications.

Next Steps

Point Codex CLI at a custom base URL
Run Codex CLI with local models (Ollama and LM Studio)
Reduce Codex and OpenAI API token costs
Review where Codex CLI stores its configuration files

How to Serve Qwen-Coder as an OpenAI-Compatible Endpoint

Why an OpenAI-Compatible Endpoint

Choose a Model Size

Method 1: Ollama (Simplest)

Install and Pull

Start the Server

Method 2: llama.cpp (More Control)

Build or Install

Download a GGUF Model

Connect Codex CLI to Your Endpoint

Quick: Environment Variables

Persistent: A Custom Provider

Serving Beyond localhost

Troubleshooting

Codex Cannot Connect

Model Not Found

Responses Are Slow

Out of Memory

Next Steps

Turn your GPU into an OpenAI-compatible endpoint

Frequently Asked Questions

How to Configure Approval and Sandbox Modes in OpenAI Codex CLI

How to Fix OpenAI Codex CLI Context Window Exceeded Errors

How to Fix OpenAI Codex CLI Slow Performance

LLM Token Counter

JSON Formatter

JWT Decoder

How to Serve Qwen-Coder as an OpenAI-Compatible Endpoint

Why an OpenAI-Compatible Endpoint

Choose a Model Size

Method 1: Ollama (Simplest)

Install and Pull

Start the Server

Method 2: llama.cpp (More Control)

Build or Install

Download a GGUF Model

Connect Codex CLI to Your Endpoint

Quick: Environment Variables

Persistent: A Custom Provider

Serving Beyond localhost

Troubleshooting

Codex Cannot Connect

Model Not Found

Responses Are Slow

Out of Memory

Next Steps

Turn your GPU into an OpenAI-compatible endpoint

Frequently Asked Questions

Related Articles

How to Configure Approval and Sandbox Modes in OpenAI Codex CLI

How to Fix OpenAI Codex CLI Context Window Exceeded Errors

How to Fix OpenAI Codex CLI Slow Performance

Related Tools

LLM Token Counter

JSON Formatter

JWT Decoder