Question 1

Is it cheaper to self-host an LLM or use an API?

Accepted Answer

It depends almost entirely on volume and which API you are replacing. **Low volume** (under ~50M tokens/month): APIs win — even budget hardware never pays for itself. **High volume against frontier APIs** (Claude Opus, GPT-5.4 Pro): self-hosting can pay for itself in weeks. **High volume against budget APIs** (Gemini Flash-Lite, Groq-hosted Llama): APIs usually still win, because providers run hardware at near-perfect utilization and you cannot. The honest comparison is against open-model hosting (Groq/Together), not against frontier models — a self-hosted Llama is not a GPT-5 replacement.

Question 2

What does it actually cost to run an LLM on my own hardware?

Accepted Answer

Three components: **hardware** (a used RTX 3090 at ~700 dollars to a Mac Studio at ~4,700 dollars, amortized over its useful life), **electricity** (a 350-450W GPU running a few hours a day costs 5-25 dollars/month at typical US rates), and **your time** (setup, updates, debugging — the hidden cost everyone forgets). For light personal use, electricity is nearly negligible; the hardware cost dominates.

Question 3

How many tokens per month can one GPU actually serve?

Accepted Answer

A single RTX 4090 running Llama 3.3 70B... cannot (it does not fit). Running an 8B model at ~100 tokens/sec, one 4090 can theoretically generate ~260M tokens/month running 24/7. With realistic 30% utilization, ~80M tokens/month. Production serving frameworks (vLLM) with request batching multiply this 5-10x by processing many requests simultaneously. This calculator includes a capacity check that flags when your volume exceeds what the hardware can deliver.

Question 4

Should I buy a GPU or rent cloud GPUs?

Accepted Answer

Rent if: your usage is bursty (training runs, batch jobs), you need datacenter GPUs (H100s cost 25K+ to buy but ~2.50/hr to rent), or you are validating an idea. Buy if: you have steady daily usage, consumer hardware covers your needs, and you will use it for 18+ months. The crossover is roughly 6-10 hours of daily use — below that, renting wins; above it, owning wins. Cloud spot/community pricing (RunPod, Vast.ai) has made renting much more competitive.

Question 5

What about the quality difference between open models and APIs like GPT and Claude?

Accepted Answer

This is the elephant in the room: a self-hosted Llama 3.3 70B is roughly comparable to mid-tier API models, not to frontier models like Claude Opus or GPT-5.4 Pro. If your workload genuinely needs frontier capability, self-hosting is not an alternative — it is a different product. The fair comparisons are: self-hosting vs open-model hosting APIs (Groq, Together), or accepting the capability trade-off in exchange for privacy, control, and cost.

Question 6

What are the non-cost reasons to self-host?

Accepted Answer

**Privacy**: prompts and data never leave your infrastructure — relevant for healthcare, legal, and anything under NDA. **No rate limits**: your hardware, your queue. **Latency consistency**: no API outages or degraded performance during peak hours. **Compliance**: some regulations effectively require data to stay on-premises. **Predictable costs**: no surprise bills from a usage spike. For many organizations these matter more than the per-token math.

Question 7

Are these prices current?

Accepted Answer

API and cloud GPU prices in this calculator were verified in June 2026. LLM API prices have been falling roughly 80% year-over-year, so check provider pages for the latest. Hardware prices are street prices for new cards (used market is typically 30-50% less). We update this dataset periodically; the "as of" date is shown in the methodology note.

Self-Hosted LLM Cost Calculator

Compare Self-Hosting an LLM Against API Pricing

The Two Cost Models

What Drives the Break-Even

When to Use It

The Real Math of Self-Hosting

When Each Option Wins (June 2026 Pricing)

Frequently Asked Questions

Related tools