Skip to main content
Home/Blog/Giving Your Local LLM an OpenAI-Compatible Endpoint (So Your Apps Just Work)
Artificial Intelligence

Giving Your Local LLM an OpenAI-Compatible Endpoint (So Your Apps Just Work)

Every major local runtime can expose an OpenAI-compatible API — which means your existing apps and SDKs can point at your own hardware with a one-line change. Here's how, and how to add failover so you're never stuck.

By InventiveHQ Team

Why "OpenAI-compatible" is the key that unlocks everything

When OpenAI shipped the /v1/chat/completions schema, it became the de-facto wire format for talking to a language model: a JSON body with model, messages, and a handful of parameters, returning choices. Years of tools — the OpenAI Python and JS SDKs, LangChain, LlamaIndex, every chat UI, your internal services — were written against that shape.

The practical consequence is enormous. You do not have to rewrite anything to move off the cloud. If your local runtime speaks the same dialect, your application points its base_url at your own hardware and everything downstream keeps working: streaming, tool calls, embeddings, the lot. The model behind the endpoint changes; your code does not.

Every serious local runtime now ships this compatibility layer. The differences are which port they bind, how they handle concurrency, and which model formats they load natively — not the API surface.

Exposing the endpoint in each runtime

Here is the lay of the land. All four expose /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models; the base URL is what changes.

RuntimeDefault base URLStart commandNotes
Ollamahttp://localhost:11434/v1ollama serve (auto-starts on install)Easiest path. Native REST on :11434; /v1 is the OpenAI layer. Default context is 2048 tokens — raise it with OLLAMA_CONTEXT_LENGTH.
LM Studiohttp://localhost:1234/v1Start the local server from the GUI (Developer tab)Desktop GUI; port configurable. v0.4.1 added an Anthropic-compatible /v1/messages endpoint (works with Claude Code). MLX backend on Apple Silicon.
llama.cpphttp://localhost:8080/v1llama-server -m model.gguf --port 8080The foundational engine; llama-server is a single binary. No daemon, no model manager — you point it at one GGUF file.
vLLMhttp://localhost:8000/v1vllm serve <hf-model-id>GPU serving engine for concurrent/production load. Native path is safetensors/AWQ/GPTQ/FP8 — GGUF support is experimental and single-file only; use llama.cpp if GGUF is all you have.

A few things worth knowing before you wire anything up:

  • Auth is a formality locally. None of these runtimes require a real key by default, but the SDKs insist on a non-empty api_key string. Pass "not-needed" or any placeholder. If you expose the endpoint beyond localhost, put a real auth layer in front of it.
  • Ollama's context default bites people. A 2048-token window silently truncates long prompts and makes a capable model look dumb. For coding or RAG, set OLLAMA_CONTEXT_LENGTH=8192 (or higher) before you benchmark anything.
  • vLLM wants the native formats. Its throughput advantages come from safetensors + PagedAttention. If you feed it GGUF you are on an experimental, under-optimized path — fine for shrinking memory footprint, wrong for production.

If you are standing up Ollama specifically, the ollama-config-generator emits the Modelfile, server environment variables (including the context-length override), and a ready-to-paste OpenAI-compatible snippet. To build the ollama run / pull / create commands themselves, use the ollama-command-builder.

Pointing your code at it

This is the whole trick. Take working OpenAI code and change two arguments — base_url and api_key. Here is the Python SDK against a local Ollama instance:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",  # was https://api.openai.com/v1
    api_key="not-needed",                   # any non-empty string locally
)

resp = client.chat.completions.create(
    model="llama3.1:8b",                    # a model your runtime has pulled
    messages=[{"role": "user", "content": "Explain GQA in one sentence."}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

The JavaScript SDK is identical in spirit:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:1234/v1", // LM Studio in this example
  apiKey: "not-needed",
});

const stream = await client.chat.completions.create({
  model: "qwen2.5-coder:7b",
  messages: [{ role: "user", content: "Write a haiku about KV cache." }],
  stream: true,
});

for await (const part of stream) {
  process.stdout.write(part.choices[0]?.delta?.content ?? "");
}

To switch runtimes, change one URL: :11434/v1 for Ollama, :1234/v1 for LM Studio, :8000/v1 for vLLM, :8080/v1 for llama.cpp. Frameworks follow the same pattern — LangChain's ChatOpenAI takes base_url/openai_api_base, LlamaIndex's OpenAILike takes api_base. Anything that lets you set the base URL will talk to your hardware.

The honest caveat: "compatible" is not "identical." Some runtimes ignore parameters they do not implement (seed, logprobs, certain response_format modes), and tool-calling fidelity varies by model and runtime. Test the specific features your app depends on against the specific model you plan to serve. Token accounting also differs — if you are comparing local cost against a cloud bill, run prompts through the llm-token-counter so you are measuring the same thing.

Serving multiple apps and users

Once the model is a server, it is shared infrastructure. Point your chat UI, your background summarizer, and your IDE plugin at the same /v1 endpoint and they all draw from one loaded model. No per-app copies in memory.

The question that decides your runtime is concurrency. A single decode stream is memory-bandwidth-bound — roughly tok/s ≈ bandwidth ÷ bytes-read-per-token — so one request rarely saturates a modern GPU's compute. Ollama and llama.cpp largely process requests in sequence; fine for a handful of users, a bottleneck when ten people hit it at once.

vLLM exists for exactly this. Its continuous (in-flight) batching packs new requests into the running batch as slots free up, and PagedAttention manages the KV cache as fixed-size non-contiguous pages — near-zero fragmentation, prefix sharing, and far higher batch occupancy. The result is dramatically higher aggregate throughput under load, even though any single request is no faster. The trade is operational weight: vLLM wants CUDA-class GPUs and native-format weights, where Ollama wants a one-line install.

A reasonable rule of thumb:

SituationPick
One user, a few apps, easiest setupOllama
Single GGUF file, no daemon, embedded usellama.cpp llama-server
Desktop, GUI-driven, Apple SiliconLM Studio
Many concurrent users, production throughputvLLM

Before you provision anything, sanity-check the hardware. The what-llm-can-i-run tool detects your GPU and tells you which models fit; the llm-vram-calculator breaks down weights plus KV cache plus overhead; and the llm-inference-speed-calculator estimates tokens/sec from your bandwidth so you know whether one box covers your concurrency or you need to scale out.

The missing piece: reliability and failover

A single self-hosted endpoint has one failure mode that the cloud doesn't: it can go away. The box reboots, the GPU is saturated by a batch job, you push a bad config, the office internet drops. When your apps address the hardware directly, every one of them breaks at once.

The fix is to stop pointing apps at hardware. Put a gateway in front that owns one stable endpoint and routes each request through a tiered fallback:

Request routing through a gateway: edge cache, then your hardware, then cloud burst Your apps one base_url Gateway route per request 1. Edge cache repeat / identical prompts 2. Your hardware markup-free baseline 3. Cloud burst only when busy / offline

The order matters. An edge cache answers identical or near-identical prompts without touching a GPU at all — instant, free, and it absorbs the embarrassingly repetitive traffic most apps generate. A miss falls through to your own hardware, the markup-free baseline where you pay for electricity and amortized silicon but no per-token fees. Only when your node is saturated or unreachable does the request burst to a cloud provider — you eat the API cost for that slice of traffic instead of returning an error.

You can build this yourself. LiteLLM offers an OpenAI-compatible proxy with model lists and fallback ordering; nginx upstream blocks with health checks and proxy_next_upstream give you crude failover. Both work. Both also become yours to operate: health-checking, cache invalidation, key management, and the cloud-provider plumbing for the burst tier.

If you would rather not run that yourself, a managed AI gateway does the routing as a service. WideAreaAI is an edge-first AI gateway built on exactly this model: your apps get one OpenAI-compatible endpoint, and each request is routed edge cache → your own hardware (a llama.cpp node reached over a Cloudflare Tunnel) → cloud burst failover. The pitch is "own your baseline, burst to the cloud" — edge-first, cloud when you choose, with no per-token fees on your own hardware. It does request-level routing, failover, and caching across whole nodes; it is not splitting one model's tensors across machines (that is clustering — a different problem). It is the natural next step once you have hit a real wall: you need a stable endpoint, you need failover, or you need on-prem inference to stay on-prem.

Whether you build or buy, run the numbers first. The self-hosted-llm-cost-calculator compares cloud-API spend against owned-hardware break-even, which is what tells you how much traffic should live on your baseline versus how much is cheaper to burst.

Conclusion

OpenAI-compatibility is what turns a local model from a science project into infrastructure. Because Ollama, LM Studio, llama.cpp, and vLLM all speak the same /v1 dialect, your existing apps and SDKs adopt your hardware with a two-line change — base_url and api_key — and nothing else. Choose the runtime that matches your concurrency: Ollama or llama.cpp for simple single-user serving, vLLM when many users hit the model at once. Then close the one gap the cloud doesn't have by putting a gateway in front that routes cache → your hardware → cloud burst, so a busy or offline node degrades gracefully instead of taking every app down with it. Own the baseline; keep the cloud as a release valve, not a dependency.

Frequently Asked Questions

Find answers to common questions

Yes. The official OpenAI SDKs let you override two things: base_url and api_key. Point base_url at your runtime (for example http://localhost:11434/v1 for Ollama) and pass any non-empty string as the key. Every call to /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models works the same way it does against api.openai.com. The only change to your code is those two lines.

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1. LM Studio's local server defaults to http://localhost:1234/v1 (and v0.4.1 added an Anthropic-compatible /v1/messages endpoint that works with Claude Code). vLLM's server defaults to http://your-host:8000/v1. llama.cpp's llama-server exposes the same /v1 surface on whatever port you start it with (commonly 8080). All four are drop-in OpenAI-compatible.

Put a gateway in front of your model so apps never address the hardware directly. The gateway holds one stable endpoint and routes each request: serve an edge cache hit if there is one, otherwise send it to your own hardware, and burst to a cloud provider only when your node is busy or offline. You can build this with LiteLLM or nginx upstream failover, or use a managed AI gateway. The key property is that your application's base_url never changes — the routing happens behind it.

Yes. Run the model as a server (Ollama or vLLM) and point every app at the same endpoint. For light, bursty load Ollama is fine. For real concurrency — many simultaneous users — use vLLM, whose continuous batching and PagedAttention keep the GPU busy across overlapping requests instead of processing them one at a time.

No. Ollama, LM Studio, and llama.cpp all run on CPU or Apple Silicon and still expose the same /v1 API; they are just slower. vLLM is the exception — it is a GPU serving engine and expects CUDA-class hardware. Use the what-llm-can-i-run tool to see which models your machine can actually serve before you commit.

The most common cause with Ollama is the default 2048-token context window, which silently truncates long prompts. Set OLLAMA_CONTEXT_LENGTH (for example 8192 or higher) to match your workload. Beyond that, check your quantization level — anything at or below Q3 degrades noticeably versus the Q4_K_M or Q8_0 weights a cloud host typically serves.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.