Skip to main content
Home/Blog/Setting Up Model Providers and Role-Based Routing in Oh My Pi
Developer Tools

Setting Up Model Providers and Role-Based Routing in Oh My Pi

A practical guide to configuring providers in models.yml, wiring role-based routing (default, smol, slow, plan, commit), running local models, and surviving 429s with fallback chains in Oh My Pi.

By Sean

If you run a terminal coding agent across more than one model, you hit the same wall everyone hits: one model is great for the hard architectural turn but burns money on a trivial subagent fan-out, another is fast but dumb, and your premium API key throws a 429 at the worst possible moment. The naive fix is to keep swapping --model flags by hand. The better fix is to declare your model topology once and let the agent route by intent.

Oh My Pi (the omp CLI) is built for exactly this. It's an open-source terminal coding agent — a fork of Mario Zechner's pi framework, written in TypeScript on the Bun runtime with Rust native addons. It bills itself as "a coding agent with the IDE wired in": LSP, a DAP debugger, persistent Python/JS cells, browser automation, subagents, and multi-provider model routing. That last piece is what this post is about. We'll cover the models.yml provider format, role-based routing, local models, and 429 fallback chains.

Install and config layout

Install with Bun (bun install -g @oh-my-pi/pi-coding-agent) or the shell installer (curl -fsSL https://omp.sh/install | sh; on Windows, irm https://omp.sh/install.ps1 | iex). You need Bun >= 1.3.14. The project ships extremely frequently — multiple releases per day is normal — so pin a version if you need reproducibility.

Two config surfaces matter here:

  • ~/.omp/agent/models.yml — declares custom providers and model equivalence. Top-level keys are providers: and equivalence: (with overrides: and exclude:).
  • Settings (settings.yml) — declares routing: which model serves which role, plus retry and fallback behavior.

Keeping providers and routing in separate files is deliberate. models.yml is "what can I talk to," settings is "what do I use, and when."

Declaring providers in models.yml

omp advertises 40+ built-in providers — frontier APIs (Anthropic, OpenAI, Google, xAI/Grok, Mistral, Groq, Cerebras, Fireworks, Together, DeepSeek), coding plans (Cursor, GitHub Copilot, GitLab Duo), and local runtimes (Ollama, LM Studio, llama.cpp). DeepSeek even publishes its own omp integration guide. You only touch models.yml when you need something custom or an override.

A provider entry needs a baseUrl, an api (endpoint type), and credentials. The supported api values are:

api: valueUse for
openai-completionsGeneric OpenAI-compatible servers (vLLM, most proxies)
openai-responsesOpenAI Responses API
openai-codex-responsesCodex-flavored Responses
azure-openai-responsesAzure OpenAI
anthropic-messagesAnthropic Messages API
google-generative-aiGemini API
google-vertexVertex AI

A custom OpenAI-compatible provider looks like this:

# ~/.omp/agent/models.yml
providers:
  my-gateway:
    baseUrl: https://gateway.internal.example.com/v1
    api: openai-completions
    apiKey: "!op read op://dev/gateway/api-key"
    discovery:
      type: openai-models-list

That ! prefix is the secrets feature: prefix any value with ! and omp runs it as a command and uses stdout. "!op read op://dev/gateway/api-key" pulls from 1Password; "!bw get password omp-team-key" pulls from Bitwarden. apiKey can also be an env var name or a literal — but command-resolved secrets keep keys out of your dotfiles. Auth methods are apiKey, none, and oauth.

The my-gateway example points at a generic proxy, but the same shape works for a local-first AI gateway that itself routes inference to your own hardware first and fails over to a cloud provider when local nodes are unavailable. Because Wide Area AI speaks the OpenAI API, you set api: openai-completions and point baseUrl at it like any other custom provider — locally served requests carry zero per-token cost, and the gateway's own failover backs you up when your hardware is offline.

A full custom model entry supports more: id, name, api, reasoning (bool), input (capabilities like [text]), cost (input/output/cacheRead/cacheWrite), contextWindow, maxTokens, headers, and contextPromotionTarget. There's also a compat block for non-standard servers — supportsStore, supportsDeveloperRole, supportsReasoningEffort, maxTokensField (e.g. max_completion_tokens), and routing hints.

Overriding a built-in provider

You don't have to redefine models to point a built-in provider at a corporate proxy. Override just the transport:

providers:
  openai:
    baseUrl: https://llm-proxy.corp.example.com/v1
    headers:
      X-Org-Id: platform-team

If your proxy fronts Anthropic and chokes on the strict tool field, add disableStrictTools: true.

Role-based routing

This is the payoff. omp defines built-in roles and lets you bind each to a different model. Configure them in settings, not in models.yml:

# settings.yml
modelRoles:
  default: anthropic/claude-sonnet-4-5
  smol: openai/gpt-4.1-mini
  slow: anthropic/claude-opus-4-5:high
  vision: gemini/gemini-3-pro-preview
  plan: anthropic/claude-opus-4-5
  commit: openai/gpt-4.1-mini

The documented role purposes:

  • default — standard conversation turns.
  • smol — cost-efficient subagent fan-out and lightweight tasks. Point this at something cheap and fast (gpt-4.1-mini, Groq, Cerebras) and you cut the bill on parallel work dramatically.
  • slow — deep reasoning. This is where a :high suffix earns its keep.
  • plan — planning mode.
  • commit — changelog and commit-message generation.

The remaining roles — vision, designer, task — exist, but their purposes aren't documented in detail, so I'd leave them at sensible defaults until you have a reason to change them.

The :high / :low style suffix is a thinking selector, not a model name. Valid values are :minimal, :low, :medium, :high, and :xhigh. So anthropic/claude-opus-4-5:high is the Opus model running at high reasoning effort.

You can override roles without editing config: --model (default), --smol/PI_SMOL_MODEL, --slow/PI_SLOW_MODEL, and --plan/PI_PLAN_MODEL. Mid-session, /model cycles models per role; modelProviderOrder and cycleOrder control the cycle ordering.

Local models

Local providers (Ollama, llama.cpp, LM Studio) get implicit auto-discovery — omp adds them automatically when it sees them running, and they all use auth: none, so no API key. Discovery types include ollama (via /api/tags and /api/show), llama.cpp, lm-studio (GET /models), plus openai-models-list and proxy.

Defaults:

RuntimeDefault baseUrlOverride env var
Ollamahttp://127.0.0.1:11434OLLAMA_BASE_URL / OLLAMA_HOST
llama.cpphttp://127.0.0.1:8080LLAMA_CPP_BASE_URL
LM Studiohttp://127.0.0.1:1234/v1LM_STUDIO_BASE_URL

For Ollama, the context window comes from OLLAMA_CONTEXT_LENGTH or metadata (default 128000), and you can set an optional OLLAMA_API_KEY. If you want it explicit rather than discovered:

providers:
  ollama:
    baseUrl: http://127.0.0.1:11434
    api: openai-responses
    auth: none
    discovery:
      type: ollama

vLLM is the exception — there's no dedicated discovery type. Treat it as a generic OpenAI-compatible server:

providers:
  vllm:
    baseUrl: http://127.0.0.1:8000/v1
    api: openai-completions
    auth: none
    discovery:
      type: openai-models-list
    compat:
      supportsDeveloperRole: false
      supportsReasoningEffort: false

The compat flags matter: a lot of vLLM builds reject the developer role or a reasoning_effort field, and turning those off avoids a wall of 400s.

Don't confuse these chat providers with omp's optional embedded "tiny models" (transformers.js / onnxruntime under Bun) used for session-title generation, Mnemopi memory extraction, and the auto thinking-difficulty classifier. Those default to "online" and download nothing unless you opt in — they're a separate concern from your Ollama/vLLM chat models.

Surviving 429s: retry and fallback chains

The retry engine is keyed by role. Define an ordered chain of fallbacks under retry.fallbackChains:

retry:
  modelFallback: true
  fallbackRevertPolicy: cooldown-expiry   # or: never
  fallbackChains:
    default:
      - anthropic/claude-sonnet-4-5
      - openai/gpt-4.1
      - groq/llama-3.3-70b
    smol:
      - openai/gpt-4.1-mini
      - groq/llama-3.1-8b-instant

When the primary throws 429s or hits a quota wall, the next entry takes over the rest of the turn. With fallbackRevertPolicy: cooldown-expiry, omp reverts to the primary once the cooldown lapses; with never, it stays put.

The retry defaults: enabled: true, maxRetries: 10, baseDelayMs: 500, maxDelayMs: 300000 (5 min), modelFallback: true, fallbackChains: {}. omp retries on 429/500/502/503/504, overloaded/rate-limit/usage-limit/too-many-requests, network/socket/timeout failures, and classifier refusals. Backoff is exponential with jitter (75–100% of nominal): 500ms, 1s, 2s, 4s, then 8s capped — though a provider's retry-after header can extend the wait up to maxDelayMs. On a credential or model-fallback switch the delay is forced to 0 for an immediate retry.

One distinction worth burning into memory: fallback chains handle rate limits, not context overflow. Context-overflow errors are excluded from retry classification. When you blow past the window, omp recovers via context promotion (contextPromotionTarget, jumping to a larger-context sibling) or auto-compaction. If you're seeing context errors and expecting your fallback chain to catch them, it won't — that's by design.

Bottom line

Set up omp once and stop babysitting flags. Declare custom and proxied providers in ~/.omp/agent/models.yml with command-resolved secrets so no keys land in plaintext. Bind roles to models in settings — a cheap smol for fan-out, a :high slow for the hard turns, a frontier default. Let local runtimes auto-discover, and configure vLLM as a generic OpenAI-compatible provider with the right compat flags. Then add retry.fallbackChains per role so a 429 rotates to the next model instead of killing your turn — while remembering that context overflow is a separate path through promotion and compaction. Get that topology right and the agent routes by intent: the right model, at the right cost, even when your primary provider is having a bad day.

Frequently Asked Questions

Find answers to common questions

Custom providers live in ~/.omp/agent/models.yml. The two top-level keys are providers: and equivalence: (the latter takes overrides: and exclude:). A custom provider needs at least baseUrl, api (the endpoint type), and apiKey unless you set auth: none. Role routing is configured separately in settings (settings.yml), not in models.yml.

Use the modelRoles block in settings. Built-in roles are default, smol, slow, vision, plan, designer, commit, and task. Each value is a provider/modelId string, optionally with a thinking suffix, for example default: anthropic/claude-sonnet-4-5 and slow: anthropic/claude-opus-4-5:high. You can also override per role with env vars or flags like PI_SMOL_MODEL/--smol and PI_SLOW_MODEL/--slow.

Add an entry under providers: in models.yml with baseUrl, the matching api type (openai-completions, openai-responses, anthropic-messages, google-generative-ai, etc.), and apiKey. For an OpenAI-compatible server you usually want api: openai-completions plus an optional discovery: { type: openai-models-list } to auto-list models.

Configure retry.fallbackChains, a record keyed by role name with an ordered array of fallback model ids. When the primary throws 429s or hits a quota wall, the next entry takes over the rest of the turn and is restored on cooldown. retry.modelFallback defaults to true, and retry.fallbackRevertPolicy defaults to cooldown-expiry.

They solve different failures. retry.fallbackChains handles rate-limit and quota errors (429s) by swapping to another model. Context overflow is handled separately through context promotion (contextPromotionTarget, switching to a larger-context sibling) or auto-compaction. Context-overflow errors are explicitly excluded from retry classification, so they never trigger the fallback chain.

omp auto-discovers local providers, so usually you just run the server. Ollama defaults to http://127.0.0.1:11434, llama.cpp to http://127.0.0.1:8080, and LM Studio to http://127.0.0.1:1234/v1. All use auth: none. You can override base URLs with env vars (OLLAMA_BASE_URL, LLAMA_CPP_BASE_URL, LM_STUDIO_BASE_URL) or declare them explicitly with a discovery block.

vLLM isn't a dedicated discovery type. Configure it as a generic OpenAI-compatible provider with api: openai-completions, a baseUrl like http://127.0.0.1:8000/v1, and auth: none. If the server rejects the developer role or reasoning_effort, set compat.supportsDeveloperRole: false and compat.supportsReasoningEffort: false.

Prefix the value with ! to resolve it from a command, for example apiKey: "!op read op://dev/openai/api-key" for 1Password or "!bw get password omp-team-key" for Bitwarden. apiKey can also reference an environment variable name or a literal value.

They select the thinking/reasoning effort for that model. Valid suffixes are :minimal, :low, :medium, :high, and :xhigh, appended to the model id, for example anthropic/claude-opus-4-5:high for deep reasoning on your slow role.

Building Something Great?

Our development team builds secure, scalable applications. From APIs to full platforms, we turn your ideas into production-ready software.