OpenAI Codex CLI can connect to any OpenAI-compatible API endpoint, including local inference servers like Ollama and LM Studio. This enables you to run coding tasks entirely on your own hardware, keeping sensitive code private and eliminating API costs.
Why Use Local Models
Running local models with Codex CLI offers several advantages:
- Privacy: Your code never leaves your machine. Ideal for proprietary codebases, client work, or sensitive projects
- Cost savings: No per-token charges after initial hardware investment
- Offline access: Work without internet connectivity
- No rate limits: Run as many requests as your hardware can handle
- Experimentation: Test different models without account restrictions
The tradeoff is that local models typically provide lower quality results than GPT-5-Codex, especially for complex multi-file refactoring. However, for routine tasks like code explanation, simple edits, and documentation, local models perform adequately.
Hardware Requirements
Local model performance depends heavily on your hardware. Here are the minimum and recommended specifications:
Minimum Requirements (7B Parameter Models)
| Component | Specification |
|---|---|
| RAM | 16GB |
| Storage | 20GB free space |
| CPU | Modern multi-core processor |
With these specs, you can run models like CodeLlama-7B, DeepSeek-Coder-6.7B, and similar lightweight coding models.
Recommended Requirements (13B-34B Parameter Models)
| Component | Specification |
|---|---|
| RAM | 32GB+ |
| GPU | NVIDIA with 8GB+ VRAM or Apple Silicon with 16GB+ unified memory |
| Storage | 100GB+ free space |
This configuration enables models like CodeLlama-34B, DeepSeek-Coder-33B, and Mixtral-8x7B which provide significantly better coding assistance.
Optimal Setup (70B+ Parameter Models)
For the best local experience, you need either:
- NVIDIA GPU with 24GB+ VRAM (RTX 4090, A6000)
- Apple Silicon Mac with 64GB+ unified memory (M2 Max, M3 Max, M4 Max)
- Multi-GPU setup with NVLink
Setting Up Ollama
Ollama is the simplest way to run local models. It handles model downloading, quantization, and provides an OpenAI-compatible API.
Installation
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com.
Download a Coding Model
Pull a code-specialized model:
# Recommended for most users (6.7B parameters, ~4GB)
ollama pull deepseek-coder:6.7b
# Better quality if you have 16GB+ RAM
ollama pull codellama:13b-instruct
# Best local coding model if you have 32GB+ RAM or GPU
ollama pull deepseek-coder:33b
Start the Ollama Server
ollama serve
By default, Ollama runs on http://localhost:11434.
Configure Codex CLI for Ollama
Set environment variables to point Codex at your local Ollama instance:
export OPENAI_API_BASE="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama" # Any non-empty string works
Now run Codex with your local model:
codex --model deepseek-coder:6.7b "explain this function"
Setting Up LM Studio
LM Studio provides a graphical interface for managing local models and includes an OpenAI-compatible server.
Installation
- Download LM Studio from lmstudio.ai
- Install and launch the application
- Search for and download a coding model (recommended: DeepSeek-Coder, CodeLlama, or Qwen2.5-Coder)
Start the Local Server
- Click the Local Server tab (left sidebar)
- Select your downloaded model
- Click Start Server
- Note the server URL (default:
http://localhost:1234/v1)
Configure Codex CLI for LM Studio
export OPENAI_API_BASE="http://localhost:1234/v1"
export OPENAI_API_KEY="lm-studio" # Any non-empty string works
Run Codex specifying the model name as shown in LM Studio:
codex --model "deepseek-coder-6.7b-instruct" "add error handling to this code"
Configuration Options
Permanent Configuration
Add local model settings to your Codex config file:
~/.codex/config.toml:
# Use local model by default
model_provider = "oss"
model = "deepseek-coder:6.7b"
# Or configure a custom provider
[model_providers.local]
base_url = "http://localhost:11434/v1"
api_key = "ollama"
# Create profiles for different setups
[profiles.local]
model_provider = "local"
model = "deepseek-coder:6.7b"
[profiles.cloud]
model_provider = "openai"
model = "gpt-5.2-codex"
Use profiles to switch between local and cloud:
codex --profile local "simple task"
codex --profile cloud "complex refactoring"
Environment Variables
For temporary configuration without modifying config files:
# Ollama
export OPENAI_API_BASE="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"
# LM Studio
export OPENAI_API_BASE="http://localhost:1234/v1"
export OPENAI_API_KEY="lm-studio"
Recommended Models for Coding
Choose your model based on available hardware and task complexity:
| Model | Size | VRAM/RAM | Best For |
|---|---|---|---|
| deepseek-coder:6.7b | ~4GB | 8GB | Quick tasks, explanations |
| codellama:13b-instruct | ~8GB | 16GB | General coding assistance |
| qwen2.5-coder:14b | ~9GB | 16GB | Balanced quality and speed |
| deepseek-coder:33b | ~20GB | 32GB | Complex coding tasks |
| codellama:70b | ~40GB | 48GB+ | Approaching cloud quality |
For code-specific tasks, prioritize models with "coder" or "code" in the name. These are fine-tuned on programming data and significantly outperform general-purpose models at coding tasks.
Performance Comparison
Local models are improving rapidly but still trail cloud models for complex tasks:
| Task Type | Local Model Quality | Cloud Model Quality |
|---|---|---|
| Code explanation | Good | Excellent |
| Simple bug fixes | Good | Excellent |
| Documentation | Good | Excellent |
| Multi-file refactoring | Fair | Excellent |
| Complex architecture | Fair | Excellent |
| Security analysis | Poor | Good |
Use local models for routine tasks and switch to cloud for complex work.
Troubleshooting
Connection Refused Error
If Codex cannot connect to your local server:
- Verify the server is running:
curl http://localhost:11434/v1/models - Check the port is not blocked by a firewall
- Ensure OPENAI_API_BASE includes the
/v1suffix
Model Not Found
If the model is not recognized:
- List available models:
ollama listor check LM Studio UI - Use the exact model name including version tag
- Pull the model first:
ollama pull model-name
Slow Response Times
If responses are too slow:
- Use a smaller quantized model (Q4 instead of Q8)
- Reduce context length in model settings
- Ensure GPU acceleration is enabled if available
- Close other memory-intensive applications
Out of Memory Errors
If you see memory errors:
- Switch to a smaller model
- Use a more aggressively quantized version (Q4_K_M)
- Reduce the context window size
- Enable swap/virtual memory (slower but works)
Hybrid Approach
The most practical setup uses local models for simple tasks and cloud models for complex work:
# ~/.codex/config.toml
# Default to local for privacy
model_provider = "oss"
model = "deepseek-coder:6.7b"
[profiles.cloud]
model_provider = "openai"
model = "gpt-5.2-codex"
Daily workflow:
# Quick local tasks (free, private)
codex "explain this function"
codex "add a docstring"
# Complex tasks (cloud quality)
codex --profile cloud "refactor this module to use dependency injection"
This approach maximizes privacy and cost savings while maintaining access to cloud-quality assistance when needed.
Next Steps
- Explore Ollama model library for more coding models
- Learn about quantization formats to optimize memory usage
- Compare with Claude Code which requires cloud but offers superior reasoning
- Read about Codex configuration options