If you run a terminal coding agent across more than one model, you hit the same wall everyone hits: one model is great for the hard architectural turn but burns money on a trivial subagent fan-out, another is fast but dumb, and your premium API key throws a 429 at the worst possible moment. The naive fix is to keep swapping --model flags by hand. The better fix is to declare your model topology once and let the agent route by intent.
Oh My Pi (the omp CLI) is built for exactly this. It's an open-source terminal coding agent — a fork of Mario Zechner's pi framework, written in TypeScript on the Bun runtime with Rust native addons. It bills itself as "a coding agent with the IDE wired in": LSP, a DAP debugger, persistent Python/JS cells, browser automation, subagents, and multi-provider model routing. That last piece is what this post is about. We'll cover the models.yml provider format, role-based routing, local models, and 429 fallback chains.
Install and config layout
Install with Bun (bun install -g @oh-my-pi/pi-coding-agent) or the shell installer (curl -fsSL https://omp.sh/install | sh; on Windows, irm https://omp.sh/install.ps1 | iex). You need Bun >= 1.3.14. The project ships extremely frequently — multiple releases per day is normal — so pin a version if you need reproducibility.
Two config surfaces matter here:
~/.omp/agent/models.yml— declares custom providers and model equivalence. Top-level keys areproviders:andequivalence:(withoverrides:andexclude:).- Settings (
settings.yml) — declares routing: which model serves which role, plus retry and fallback behavior.
Keeping providers and routing in separate files is deliberate. models.yml is "what can I talk to," settings is "what do I use, and when."
Declaring providers in models.yml
omp advertises 40+ built-in providers — frontier APIs (Anthropic, OpenAI, Google, xAI/Grok, Mistral, Groq, Cerebras, Fireworks, Together, DeepSeek), coding plans (Cursor, GitHub Copilot, GitLab Duo), and local runtimes (Ollama, LM Studio, llama.cpp). DeepSeek even publishes its own omp integration guide. You only touch models.yml when you need something custom or an override.
A provider entry needs a baseUrl, an api (endpoint type), and credentials. The supported api values are:
api: value | Use for |
|---|---|
openai-completions | Generic OpenAI-compatible servers (vLLM, most proxies) |
openai-responses | OpenAI Responses API |
openai-codex-responses | Codex-flavored Responses |
azure-openai-responses | Azure OpenAI |
anthropic-messages | Anthropic Messages API |
google-generative-ai | Gemini API |
google-vertex | Vertex AI |
A custom OpenAI-compatible provider looks like this:
# ~/.omp/agent/models.yml
providers:
my-gateway:
baseUrl: https://gateway.internal.example.com/v1
api: openai-completions
apiKey: "!op read op://dev/gateway/api-key"
discovery:
type: openai-models-list
That ! prefix is the secrets feature: prefix any value with ! and omp runs it as a command and uses stdout. "!op read op://dev/gateway/api-key" pulls from 1Password; "!bw get password omp-team-key" pulls from Bitwarden. apiKey can also be an env var name or a literal — but command-resolved secrets keep keys out of your dotfiles. Auth methods are apiKey, none, and oauth.
The my-gateway example points at a generic proxy, but the same shape works for a local-first AI gateway that itself routes inference to your own hardware first and fails over to a cloud provider when local nodes are unavailable. Because Wide Area AI speaks the OpenAI API, you set api: openai-completions and point baseUrl at it like any other custom provider — locally served requests carry zero per-token cost, and the gateway's own failover backs you up when your hardware is offline.
A full custom model entry supports more: id, name, api, reasoning (bool), input (capabilities like [text]), cost (input/output/cacheRead/cacheWrite), contextWindow, maxTokens, headers, and contextPromotionTarget. There's also a compat block for non-standard servers — supportsStore, supportsDeveloperRole, supportsReasoningEffort, maxTokensField (e.g. max_completion_tokens), and routing hints.
Overriding a built-in provider
You don't have to redefine models to point a built-in provider at a corporate proxy. Override just the transport:
providers:
openai:
baseUrl: https://llm-proxy.corp.example.com/v1
headers:
X-Org-Id: platform-team
If your proxy fronts Anthropic and chokes on the strict tool field, add disableStrictTools: true.
Role-based routing
This is the payoff. omp defines built-in roles and lets you bind each to a different model. Configure them in settings, not in models.yml:
# settings.yml
modelRoles:
default: anthropic/claude-sonnet-4-5
smol: openai/gpt-4.1-mini
slow: anthropic/claude-opus-4-5:high
vision: gemini/gemini-3-pro-preview
plan: anthropic/claude-opus-4-5
commit: openai/gpt-4.1-mini
The documented role purposes:
- default — standard conversation turns.
- smol — cost-efficient subagent fan-out and lightweight tasks. Point this at something cheap and fast (
gpt-4.1-mini, Groq, Cerebras) and you cut the bill on parallel work dramatically. - slow — deep reasoning. This is where a
:highsuffix earns its keep. - plan — planning mode.
- commit — changelog and commit-message generation.
The remaining roles — vision, designer, task — exist, but their purposes aren't documented in detail, so I'd leave them at sensible defaults until you have a reason to change them.
The
:high/:lowstyle suffix is a thinking selector, not a model name. Valid values are:minimal,:low,:medium,:high, and:xhigh. Soanthropic/claude-opus-4-5:highis the Opus model running at high reasoning effort.
You can override roles without editing config: --model (default), --smol/PI_SMOL_MODEL, --slow/PI_SLOW_MODEL, and --plan/PI_PLAN_MODEL. Mid-session, /model cycles models per role; modelProviderOrder and cycleOrder control the cycle ordering.
Local models
Local providers (Ollama, llama.cpp, LM Studio) get implicit auto-discovery — omp adds them automatically when it sees them running, and they all use auth: none, so no API key. Discovery types include ollama (via /api/tags and /api/show), llama.cpp, lm-studio (GET /models), plus openai-models-list and proxy.
Defaults:
| Runtime | Default baseUrl | Override env var |
|---|---|---|
| Ollama | http://127.0.0.1:11434 | OLLAMA_BASE_URL / OLLAMA_HOST |
| llama.cpp | http://127.0.0.1:8080 | LLAMA_CPP_BASE_URL |
| LM Studio | http://127.0.0.1:1234/v1 | LM_STUDIO_BASE_URL |
For Ollama, the context window comes from OLLAMA_CONTEXT_LENGTH or metadata (default 128000), and you can set an optional OLLAMA_API_KEY. If you want it explicit rather than discovered:
providers:
ollama:
baseUrl: http://127.0.0.1:11434
api: openai-responses
auth: none
discovery:
type: ollama
vLLM is the exception — there's no dedicated discovery type. Treat it as a generic OpenAI-compatible server:
providers:
vllm:
baseUrl: http://127.0.0.1:8000/v1
api: openai-completions
auth: none
discovery:
type: openai-models-list
compat:
supportsDeveloperRole: false
supportsReasoningEffort: false
The compat flags matter: a lot of vLLM builds reject the developer role or a reasoning_effort field, and turning those off avoids a wall of 400s.
Don't confuse these chat providers with omp's optional embedded "tiny models" (transformers.js / onnxruntime under Bun) used for session-title generation, Mnemopi memory extraction, and the
autothinking-difficulty classifier. Those default to "online" and download nothing unless you opt in — they're a separate concern from your Ollama/vLLM chat models.
Surviving 429s: retry and fallback chains
The retry engine is keyed by role. Define an ordered chain of fallbacks under retry.fallbackChains:
retry:
modelFallback: true
fallbackRevertPolicy: cooldown-expiry # or: never
fallbackChains:
default:
- anthropic/claude-sonnet-4-5
- openai/gpt-4.1
- groq/llama-3.3-70b
smol:
- openai/gpt-4.1-mini
- groq/llama-3.1-8b-instant
When the primary throws 429s or hits a quota wall, the next entry takes over the rest of the turn. With fallbackRevertPolicy: cooldown-expiry, omp reverts to the primary once the cooldown lapses; with never, it stays put.
The retry defaults: enabled: true, maxRetries: 10, baseDelayMs: 500, maxDelayMs: 300000 (5 min), modelFallback: true, fallbackChains: {}. omp retries on 429/500/502/503/504, overloaded/rate-limit/usage-limit/too-many-requests, network/socket/timeout failures, and classifier refusals. Backoff is exponential with jitter (75–100% of nominal): 500ms, 1s, 2s, 4s, then 8s capped — though a provider's retry-after header can extend the wait up to maxDelayMs. On a credential or model-fallback switch the delay is forced to 0 for an immediate retry.
One distinction worth burning into memory: fallback chains handle rate limits, not context overflow. Context-overflow errors are excluded from retry classification. When you blow past the window, omp recovers via context promotion (contextPromotionTarget, jumping to a larger-context sibling) or auto-compaction. If you're seeing context errors and expecting your fallback chain to catch them, it won't — that's by design.
Bottom line
Set up omp once and stop babysitting flags. Declare custom and proxied providers in ~/.omp/agent/models.yml with command-resolved secrets so no keys land in plaintext. Bind roles to models in settings — a cheap smol for fan-out, a :high slow for the hard turns, a frontier default. Let local runtimes auto-discover, and configure vLLM as a generic OpenAI-compatible provider with the right compat flags. Then add retry.fallbackChains per role so a 429 rotates to the next model instead of killing your turn — while remembering that context overflow is a separate path through promotion and compaction. Get that topology right and the agent routes by intent: the right model, at the right cost, even when your primary provider is having a bad day.