OpenAI Codex CLI and the OpenAI API bill per token, and a busy day of coding can add up faster than expected. The good news is that most spend is avoidable: it comes from re-sending context, using a premium model for trivial tasks, and paying full price for requests that could have been cached or routed elsewhere. This guide walks through the highest-impact ways to lower the bill without giving up the assistance you rely on.
Understand Where Tokens Go
You are billed for input tokens (everything you send) and output tokens (everything the model generates). In an agentic CLI like Codex, input usually dominates because every turn re-sends:
- The system prompt and tool definitions
- The running conversation history
- File context Codex has read into the session
A single long session over a large file can re-send tens of thousands of input tokens per turn. Output is comparatively small. That means your biggest savings come from controlling context, not from writing shorter questions.
Check your actual usage before optimizing:
# OpenAI usage dashboard breaks spend down by model and day
# platform.openai.com/usage
Tactic 1: Trim and Reset Context
Long-lived sessions accumulate history that you pay for on every turn.
- Start a fresh session for unrelated tasks. Do not ask Codex about your auth module in the same session you were debugging CSS.
- Avoid pulling huge files into context when a snippet will do. Reference the specific function rather than the whole 2,000-line file.
- Summarize and restart for very long sessions instead of letting history grow without bound.
This single habit often produces the largest reduction because it attacks the multiplier (history re-sent per turn), not a one-time cost.
Tactic 2: Lean on Prompt Caching
OpenAI automatically caches identical prompt prefixes longer than 1,024 tokens and bills the cached portion at a steep discount. You do not enable it, but you can structure prompts to benefit:
- Keep the stable content at the front. Put the system message, instructions, and unchanging file context first; put the variable part (your specific question) last. A stable prefix stays cacheable across calls.
- Reuse the same context shape. Re-reading the same file in the same order keeps the prefix identical, so repeat calls hit the cache.
For scripted OpenAI SDK usage, the same rule applies: identical leading messages across requests get the cached discount automatically.
Tactic 3: Match the Model to the Task
The most capable coding model is also the most expensive. Most day-to-day requests do not need it.
Set up profiles in ~/.codex/config.toml so switching is one flag:
# ~/.codex/config.toml
# Cheap default for routine work
model = "gpt-5.2-codex-mini"
[profiles.heavy]
model_provider = "openai"
model = "gpt-5.2-codex"
# Routine — cheap model
codex "add a docstring to this function"
# Hard — premium model, only when it earns its cost
codex --profile heavy "refactor this module to use dependency injection"
Reserving the premium model for genuinely hard problems is one of the easiest structural savings available.
Tactic 4: Batch Independent Requests
If you are scripting many small, independent completions (for example generating docstrings for 200 functions), do not send them as one giant interactive session. Two cheaper paths:
- The Batch API processes large request sets asynchronously at a significant discount versus synchronous calls, when you do not need an instant response.
- Group locally first. Combine several tiny related asks into one request rather than paying the fixed prompt overhead 200 times.
Tactic 5: Route Routine Work to a GPU You Own
The tactics above shrink the bill; local routing can remove it for a large share of requests. Because Codex CLI and the OpenAI SDK speak the OpenAI-compatible API, you can point them at a local coding model (Qwen-Coder, DeepSeek-Coder, CodeLlama) served as an OpenAI-compatible endpoint. Requests that model handles well, explanations, simple edits, docstrings, cost nothing per token.
# Route to a local OpenAI-compatible server
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="local"
codex --model qwen2.5-coder:14b "explain this function"
The catch with a hard split is that local hardware is not always the right place to run a request: a complex refactor still wants cloud quality, and your GPU is sometimes busy or offline. That is where an edge-first gateway helps. A gateway sits at the base URL Codex already uses, serves repeated prompts from an edge cache, runs eligible requests on the GPU you own for free, and bursts to the cloud only on failover or for tasks that need it. Wide Area AI implements exactly this routing, so you get the cost profile of local inference with the reliability of the cloud, without changing how Codex is configured.
Putting It Together
A practical low-cost setup combines all five tactics:
- Reset sessions between unrelated tasks and keep context tight.
- Order prompts so the stable prefix stays cacheable.
- Default to a cheap or local model; escalate to premium per command.
- Batch large independent jobs through the Batch API.
- Route routine requests to hardware you own and burst to cloud only when needed.
Each tactic is independent, so adopt them incrementally and watch the usage dashboard confirm the drop.
Troubleshooting
My Bill Did Not Drop After Switching Models
Confirm the cheap model is actually the default. Run codex --version and check that no environment variable or profile is forcing the premium model. Verify against the per-model breakdown in the usage dashboard.
Caching Does Not Seem to Apply
Cached discounts require an identical prefix over 1,024 tokens. If the variable part of your prompt is near the front, every call invalidates the cache. Move stable content forward.
Local Model Quality Is Too Low for Some Tasks
That is expected. Use local routing for routine work and escalate hard tasks to the cloud. A gateway automates this split so you do not have to decide per request.
Next Steps
- Point Codex CLI at a custom base URL to route requests through a gateway
- Run Codex CLI with local models
- Serve Qwen-Coder as an OpenAI-compatible endpoint
- Compare AI coding CLI pricing across tools