AI Gateway Guide: What They Are, Why You Need One, and How to Choose

Q: Is Cloudflare AI Gateway free?

The core features are free — caching, rate limiting, analytics, fallbacks, and guardrails. You get 100,000 requests per day and 100,000 persistent logs per month on the free tier. The $5/month Workers Paid plan increases this to 10 million requests per month and 1 million logs per month. You still pay each AI provider separately for their token costs — AI Gateway does not replace provider billing.

Q: What's the difference between an AI gateway and an API gateway?

A traditional API gateway (like Kong, Nginx, or AWS API Gateway) handles general HTTP traffic — routing, authentication, rate limiting, and protocol translation for any API. An AI gateway is purpose-built for LLM and AI model traffic. It understands tokens (not just requests), can cache prompt-response pairs, supports model-specific fallback chains, tracks costs per model, and applies AI-specific guardrails like PII detection and content moderation. Some products blur the line — Azure APIM and Kong add AI-specific policies on top of their general API management.

Q: Do AI gateways add latency to requests?

Yes, but the overhead varies dramatically. Bifrost (Go-based) adds less than 11 microseconds at 5,000 requests per second. Helicone (Rust-based) adds 1-5ms at P95. Managed services like Cloudflare AI Gateway run on edge networks so the proxy hop is minimal. OpenRouter adds 25-40ms. The latency trade-off is usually worth it — caching alone can reduce response times by 90% for repeated queries, and fallback routing prevents total failures when a provider goes down.

Q: Can I use an AI gateway with streaming responses?

Most AI gateways support streaming (SSE — Server-Sent Events) for chat completions. Cloudflare AI Gateway, Portkey, LiteLLM, and Helicone all pass through streaming responses. Some features like caching and guardrails may behave differently with streaming — for example, content moderation may only apply after the full response is assembled, not per-chunk. Cloudflare's universal endpoint also supports WebSocket connections for the OpenAI Realtime API.

Q: Which AI gateway supports the most providers?

Portkey claims 1,600+ LLMs across its supported providers. OpenRouter offers 290+ models accessible through a single API. Bifrost and LiteLLM each support 100+ providers and claim 1,000+ models. Cloudflare AI Gateway natively supports 24 providers but also allows custom provider configuration for any HTTPS-based AI API. The right question is usually not 'which supports the most?' but 'does it support the 2-3 providers I actually use?'

Q: Should I self-host or use a managed AI gateway?

Self-host (LiteLLM, Bifrost, Kong) if you need full data sovereignty, have strict compliance requirements, want zero vendor lock-in, or have the infrastructure team to maintain it. Use managed (Cloudflare, Portkey, Helicone, Vercel) if you want zero operational overhead, faster setup, and built-in global distribution. The middle ground is hybrid deployment — Portkey and Kong offer both SaaS and self-hosted options.

Q: Do AI gateways support semantic caching?

Most AI gateways today only support exact-match caching — the prompt must be character-for-character identical to return a cached response. Azure APIM is the notable exception, offering semantic caching via Azure Managed Redis with vector similarity search. Cloudflare AI Gateway has semantic caching on its roadmap but hasn't shipped it yet. Bifrost and Kong also offer semantic caching. For most production workloads, exact-match caching still delivers significant cost savings on repeated queries like system prompts and common questions.

Q: Can I use multiple AI gateways together?

Technically yes, but it's rarely practical. You could use LiteLLM for provider abstraction and route its output through Cloudflare AI Gateway for edge caching and analytics, for example. But each additional proxy layer adds latency and complexity. Most teams are better served picking one gateway that covers their primary needs rather than chaining multiple gateways together.

Q: What happens if my AI gateway goes down?

This depends on how you've integrated. If you're using a managed gateway as a simple proxy (URL swap), a gateway outage means your AI calls fail — you should implement client-side fallback to call providers directly as a safety net. Self-hosted gateways give you more control but require your own high-availability setup. Cloudflare AI Gateway runs on their global edge network with built-in redundancy. Portkey offers a 99.99% uptime SLA on enterprise plans.

Q: Are AI gateways necessary for single-provider apps?

Not strictly necessary, but still valuable. Even if you only use OpenAI, an AI gateway gives you request caching (cost savings), analytics (usage visibility), rate limiting (abuse prevention), and guardrails (content safety). The real value shows up when you want to switch providers or add a fallback — with a gateway, it's a configuration change instead of a code rewrite. If you're prototyping or building a low-volume internal tool, calling the provider directly is perfectly fine.

Introduction

If your application calls an LLM API, you have a problem — even if you don't know it yet.

When you call OpenAI, Anthropic, or Google directly, every request is a black box. You can't easily see how much each call costs, you can't cache identical prompts to save money, and when a provider goes down at 2 AM, your app goes down with it. Scale to multiple providers and the problem multiplies: different API formats, different authentication, different error handling, different billing.

An AI gateway sits between your application and your AI providers. It's a proxy layer that gives you a single integration point with unified observability, caching, fallbacks, security, and cost tracking — regardless of which models you use behind it. Think of it as a reverse proxy purpose-built for LLM traffic.

The category has exploded since 2024. There are now 12+ serious options ranging from free open-source libraries to enterprise cloud services. This guide explains what AI gateways do, why they matter, and how to evaluate the growing field of options. It's the first in a series — we'll follow up with in-depth head-to-head comparisons of individual gateways.

In a hurry? Try our free AI Gateway Selector to compare options and choose the right AI gateway for your stack in seconds.

What Is an AI Gateway?

An AI gateway is a proxy layer that sits between your application and one or more AI model providers. Instead of calling api.openai.com directly, your application calls the gateway, which forwards the request to the provider while adding observability, caching, security, and routing logic.

┌─────────────────┐
│  Your Application│
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────┐
│         AI Gateway              │
│  ┌───────────┐ ┌─────────────┐ │
│  │  Caching   │ │  Analytics  │ │
│  ├───────────┤ ├─────────────┤ │
│  │ Rate Limit │ │  Guardrails │ │
│  ├───────────┤ ├─────────────┤ │
│  │ Fallbacks  │ │  Cost Track │ │
│  └───────────┘ └─────────────┘ │
└────────┬────────────┬──────────┘
         │            │
    ┌────▼───┐   ┌───▼────┐
    │ OpenAI │   │Anthropic│  ...
    └────────┘   └────────┘

The integration is usually minimal. Most gateways are OpenAI API-compatible, meaning you swap the base URL in your SDK client and everything else stays the same. A typical change looks like:

# Before: direct to OpenAI
client = OpenAI(api_key="sk-...")

# After: through AI gateway
client = OpenAI(
    api_key="sk-...",
    base_url="https://gateway.ai.cloudflare.com/v1/{account}/{gateway}/openai"
)

The gateway handles the rest — forwarding to the provider, applying your policies, logging the request, and returning the response.

Why Use an AI Gateway?

Unified API Across Providers

Every AI provider has its own API format, authentication scheme, and SDK. OpenAI uses messages arrays, Anthropic has a different message structure, Google uses yet another format. An AI gateway normalizes these into a single API — typically the OpenAI chat completions format — so you can swap models and providers without rewriting application code.

This matters more than it seems. When a new model drops (and they drop weekly now), you want to test it without touching your codebase. A gateway turns model switching into a configuration change.

Cost Tracking and Budgets

AI API costs are notoriously hard to predict. A single GPT-4 call with a large context window can cost $0.50+. Without visibility, bills compound fast.

AI gateways track token usage, estimate costs per request, and aggregate spending by model, team, or application. Some (Portkey, LiteLLM, Cloudflare) support budget limits that reject requests once a threshold is reached — preventing a runaway loop from draining your API credits overnight.

Caching

Identical prompts return identical responses. If your app sends the same system prompt, the same FAQ question, or the same classification task repeatedly, a cache can return the previous response instantly — saving both money and latency.

Most gateways today support exact-match caching: the prompt must be byte-for-byte identical. Cache hits can reduce latency by 90%+ and eliminate the token cost entirely. Azure APIM goes further with semantic caching via Redis vector search, where semantically similar (but not identical) prompts can return cached results. Others have semantic caching on their roadmaps.

Fallbacks and Reliability

Provider outages happen. OpenAI has had multiple significant outages, Anthropic has rate limits that kick in under load, and regional providers can be intermittent. An AI gateway lets you define a fallback chain:

Try Claude on Anthropic
If that fails, try Claude on AWS Bedrock
If that fails, try GPT-4o on OpenAI

The gateway handles retries and failover automatically. Your application sees a single successful response (or a final error) — never the intermediate failures. This is the single most compelling reason to adopt a gateway for production applications.

Rate Limiting and Abuse Prevention

If your AI features are user-facing, you need rate limiting. Without it, a single user (or bot) can exhaust your API quota and budget. AI gateways offer rate limiting at the gateway level — per API key, per user, or per endpoint — separate from the provider's own rate limits.

Observability and Logging

When something goes wrong with an AI call (wrong output, high latency, unexpected cost), you need logs. AI gateways automatically log requests and responses with metadata: latency, token counts, model used, cost estimate, and error codes. This gives you a searchable history of every AI interaction your application makes.

Security and Guardrails

AI gateways increasingly include content safety features: scanning prompts for PII before they reach the provider, filtering unsafe model outputs, and blocking prompt injection attempts. Cloudflare AI Gateway uses Llama Guard for content moderation. Azure APIM integrates with Azure AI Content Safety. Portkey offers 50+ built-in guardrails.

For regulated industries, guardrails aren't optional — they're a compliance requirement.

Model Routing and A/B Testing

Want to compare GPT-4o vs Claude 3.5 on real traffic? Gateways with dynamic routing let you split traffic by percentage, route different user segments to different models, or chain models in sequence (use a fast model for classification, then a powerful model for generation). Cloudflare, Portkey, and LiteLLM all support this.

When You Don't Need an AI Gateway

Not every project needs a gateway. Skip it if:

You're prototyping or experimenting. Direct API calls are simpler. Don't add infrastructure before you've validated the product.
You use a single provider with low volume. If you're making 100 OpenAI calls per day for an internal tool, the overhead of a gateway isn't justified. Use the provider's built-in dashboard for analytics.
Your framework already handles it. Tools like Vercel AI SDK, LangChain, and LlamaIndex have their own provider abstraction and retry logic. Adding a gateway on top may be redundant.
You have strict data residency requirements and can't use any intermediary. Some compliance regimes require point-to-point communication with no proxies in between (though self-hosted gateways solve this).

The tipping point usually comes when you hit two or more of: multiple providers, production traffic, cost concerns, reliability requirements, or team-level usage tracking.

The AI Gateway Landscape

The market splits into three categories, each with different trade-offs.

Managed / SaaS Gateways

Hosted services where you sign up, get a URL, and start routing traffic. Zero infrastructure to manage.

Cloudflare AI Gateway — Free core, built on Cloudflare's edge network
Portkey — Enterprise-focused with the broadest model coverage
Helicone — Observability-first with open-source roots
Vercel AI Gateway — Tightly integrated with the Next.js/Vercel ecosystem
OpenRouter — Marketplace-style access to 290+ models

Cloud-Native Gateways

AI gateway capabilities built into major cloud platforms. Best if you're already invested in that ecosystem.

AWS Bedrock + AgentCore — Consumption-based, deep AWS integration
Azure API Management — AI policies added to Microsoft's API management platform
Google Apigee + Vertex AI — Enterprise API management with AI extensions

Open-Source / Self-Hosted Gateways

Deploy on your own infrastructure for maximum control and data sovereignty.

LiteLLM — The most popular open-source option (33K+ GitHub stars)
Bifrost — Go-based, built for extreme performance
Braintrust Proxy — Free proxy with eval platform integration
Kong AI Gateway — AI extensions on top of Kong's API management

High-Level Comparison

Gateway	Type	Pricing	Providers	Open Source	Self-Host	Best For
Cloudflare AI Gateway	Managed	Free core	24 native + custom	No	No	Teams on Cloudflare, free tier
Portkey	Managed	Free tier, Pro custom	1,600+ LLMs	Gateway OSS	Yes (Enterprise)	Enterprise governance
Helicone	Managed	Free / $79 / $799 mo	100+	Yes (Apache 2.0)	Yes	Observability-focused
Vercel AI Gateway	Managed	$5 free credit/mo	100+ models	No	No	Next.js / Vercel apps
OpenRouter	Managed	Pay-per-token	290+ models	No	No	Model exploration
AWS Bedrock	Cloud-native	Consumption-based	20+ Bedrock models	No	No	AWS-native enterprises
Azure APIM	Cloud-native	$48–$2,700+/mo	Azure OpenAI + external	No	No	Microsoft / Azure shops
Google Apigee	Cloud-native	Enterprise pricing	Vertex AI + external	No	No	Google Cloud enterprises
LiteLLM	Self-hosted	Free (MIT) / $250 mo	100+ providers	Yes (MIT)	Yes	Max flexibility, self-host
Bifrost	Self-hosted	Free (Apache 2.0)	1,000+ models	Yes (Apache 2.0)	Yes	Performance-critical
Braintrust Proxy	Self-hosted	Free	100+ models	Yes	Yes	Dev/testing + evals
Kong AI Gateway	Self-hosted	~$105/mo per service	10+ major providers	Core OSS	Yes	Existing Kong users

Managed Gateway Profiles

Cloudflare AI Gateway

Cloudflare's offering stands out for one reason: the core is genuinely free. Caching, rate limiting, analytics, fallbacks, and guardrails cost nothing on the free tier — you get 100,000 requests per day and 100,000 persistent logs per month. The $5/month Workers Paid plan raises those to 10 million requests and 1 million logs.

The gateway runs on Cloudflare's global edge network (310+ cities), so the proxy layer adds minimal latency. Setup is a URL swap — point your SDK's base URL to gateway.ai.cloudflare.com and you're running. It natively supports 24 providers including OpenAI, Anthropic, Google, AWS Bedrock, and Mistral, plus custom provider support for any HTTPS API.

Strengths: Free tier is unmatched. Edge-deployed globally. Guardrails powered by Llama Guard. Dynamic routing with visual UI. Unified billing (closed beta) to pay multiple providers through one Cloudflare invoice.

Limitations: Log storage caps (10M total per gateway, then logging stops). No semantic caching yet (exact match only). Analytics are aggregate-level — no deep tracing, team-level cost attribution, or RBAC. Observability is shallower than dedicated observability platforms like Helicone.

Best for: Teams already on Cloudflare, startups wanting a free production-ready gateway, and projects that need basic caching/fallback without operational overhead.

Portkey

Portkey is the most feature-rich dedicated AI gateway, targeting teams that need production governance at scale. It raised a $15M Series A in February 2026 and claims 1,600+ LLMs across its supported providers.

The platform includes everything: unified API, automatic fallbacks, semantic caching, 50+ guardrails, prompt management with versioning, observability with logs and traces, and a FinOps dashboard for cost tracking. Enterprise plans add RBAC, SSO, hierarchical budget controls, and SOC 2/HIPAA compliance.

Strengths: Broadest model coverage. Production-grade governance with per-team budgets and access controls. Semantic caching with unlimited TTL on Pro. Three deployment modes (SaaS, hybrid, air-gapped). Recently made core gateway functionality free.

Limitations: Pro plan pricing is custom (opaque). Log retention limited to 30 days on Pro. No hard budget caps — potential for bill shock. The breadth of features means a steeper learning curve.

Best for: Mid-to-large teams running AI in production that need governance, compliance, and multi-team cost controls.

Helicone

Helicone takes an observability-first approach. Built in Rust for performance (1-5ms P95 latency overhead), it automatically logs every AI request with cost tracking, latency metrics, and error monitoring. The gateway functionality — fallbacks, load balancing, caching — was added on top of the observability core.

It's open source under Apache 2.0 and can be self-hosted. The managed service starts free (10K requests/month) with Pro at $79/month and Team at $799/month. Special pricing exists for startups and educators.

Strengths: Rust-based performance. Observability depth (every request auto-logged and analyzed). Health-aware load balancing (Power of Two Choices with PeakEWMA). Open source with self-hosting option. Zero markup on provider pricing.

Limitations: Free tier has only 7-day data retention. Gateway features are still in beta. Primarily an observability platform — gateway capabilities are newer and less mature than dedicated gateways. Pro plan has 1-month retention only.

Best for: Teams that prioritize understanding their AI usage patterns and costs, and want open-source flexibility.

Vercel AI Gateway

Vercel's gateway is purpose-built for the Next.js and Vercel ecosystem. It offers a unified API for 100+ models with zero markup on token pricing, sub-20ms routing latency, and automatic failover based on real-time provider health.

Integration is seamless if you're already on Vercel — it works natively with the Vercel AI SDK and deploys alongside your application. Every Vercel account gets $5/month in free AI credits.

Strengths: Zero markup pricing. Dynamic provider routing based on real-time uptime and latency. Tightest integration with Next.js and the AI SDK. Simple DX for Vercel users.

Limitations: Payload limit of 4.5 MB. 5-minute function timeouts (13 min max with Fluid Compute). No semantic caching. No self-hosted option. Platform lock-in to Vercel. Not viable outside the Vercel ecosystem.

Best for: Teams building AI features in Next.js on Vercel. Not a general-purpose gateway.

OpenRouter

OpenRouter is less a gateway and more an AI model marketplace. It provides a unified OpenAI-compatible API to 290+ models from every major provider, including dozens of free models. There's no monthly fee — you pay per token at (approximately) provider rates.

It's popular for model exploration and rapid prototyping because you can access any model instantly without setting up individual provider accounts.

Strengths: Largest model catalog (290+). No account required for basic usage. Free model tier. Immediate access to new models. Pay-per-token with no commitments.

Limitations: Adds 25-40ms latency per request. Limited observability — no deep logging or cost analytics. No self-hosting. No guardrails or content moderation. Not designed for production governance. Some reports of 5% markup despite zero-markup claims.

Best for: Developers exploring models, hackathon projects, and applications that need access to a wide variety of models without individual provider accounts.

Cloud-Native Gateway Profiles

AWS Bedrock + AgentCore Gateway

AWS doesn't have a standalone "AI gateway" product. Instead, gateway capabilities are spread across Amazon Bedrock (model hosting and access), API Gateway (proxy and throttling), and the newer Bedrock AgentCore Gateway (tool/API discovery for AI agents).

Bedrock itself provides access to 20+ model families (Claude, Llama, Mistral, Nova, and more) with pay-per-token pricing and provisioned throughput options. AgentCore Gateway adds MCP proxy support, transforming existing REST APIs into agent-consumable endpoints. For multi-provider routing beyond Bedrock-hosted models, teams typically integrate LiteLLM.

Strengths: Deep AWS ecosystem integration (IAM, VPC, CloudWatch, Lambda). Enterprise security with managed identities and VPC isolation. Provisioned throughput for guaranteed capacity. 9 AWS regions. MCP proxy capability.

Limitations: Not a unified AI gateway product — requires assembling from multiple services. Complex pricing across Bedrock + API Gateway + CloudWatch + data transfer. Multi-provider routing requires additional tooling. Vendor lock-in to AWS.

Best for: Organizations deeply invested in AWS that want to manage AI model access through their existing IAM, networking, and compliance infrastructure.

Azure API Management (AI Gateway)

Microsoft's approach adds AI-specific policies to Azure API Management (APIM) — their existing enterprise API gateway. It's not a separate product but a set of LLM-aware capabilities: token-based rate limiting, semantic caching via Azure Managed Redis, load balancing with circuit breakers, and integration with Azure AI Content Safety.

The recently announced Microsoft Foundry integration provides centralized governance across models, agents, and tools. APIM also supports MCP server governance and A2A agent API management (preview).

Strengths: Semantic caching (the only major gateway with production-ready vector similarity caching). Deep Azure/Entra ID integration. Policy-based configuration for granular control. Enterprise networking (VNET injection, private endpoints). MCP and A2A governance.

Limitations: High fixed costs — Standard V2 starts at ~$700/month, Premium at $2,700+/month per unit. Not AI-native — requires learning APIM's XML policy language. Semantic caching needs a separate Redis Enterprise deployment. Steep learning curve. Primarily optimized for Azure OpenAI; other providers need more manual setup.

Best for: Microsoft/Azure enterprises that already use or plan to use Azure API Management. The semantic caching capability is a genuine differentiator for high-volume, high-repetition workloads.

Google Apigee + Vertex AI

Google's approach mirrors Azure's: add AI capabilities to Apigee, their existing enterprise API management platform. Model Armor (public preview) provides LLM governance with prompt validation and output filtering. Vertex AI serves as the model hosting layer, with the Model Garden offering third-party models alongside Google's Gemini family.

GKE Inference Gateway handles optimized load balancing for self-hosted inference workloads. Apigee adds semantic caching, rate limiting, and intelligent multi-model routing.

Strengths: Model Armor for built-in LLM safety governance. Apigee is a Gartner Leader for API management. Strong ML/AI ecosystem (Vertex AI, TPUs). GKE Inference Gateway for self-hosted models.

Limitations: Requires combining multiple products (Apigee + Vertex AI + GKE). Apigee is expensive enterprise software. Multi-provider routing beyond Google-hosted models requires significant configuration. Key features (Feature Templater, API spec boosting) still in preview.

Best for: Google Cloud enterprises using Vertex AI that need API-level governance for their AI traffic.

Open-Source Gateway Profiles

LiteLLM

The most popular open-source AI gateway with 33,000+ GitHub stars. LiteLLM is a Python SDK and proxy server that provides an OpenAI-compatible API for 100+ providers. It's the default choice for teams that want self-hosted model abstraction.

Features include cost tracking per project/team/key, load balancing, automatic fallbacks, guardrails, virtual key management, and MCP gateway support. The admin UI provides configuration and monitoring. Enterprise plans ($250/month) add Prometheus metrics, JWT auth, SSO, and audit logs.

Strengths: Widest open-source adoption. Deepest provider support. Full control over data. Plugin architecture for custom logic. MIT licensed.

Limitations: Python-based, so higher latency overhead than Rust or Go alternatives. Self-hosting requires operational effort (~$200-500/month infrastructure). Admin UI and documentation can be rough. Enterprise features require a paid license.

Best for: Teams that want maximum flexibility, need self-hosting for data sovereignty, or want to customize gateway behavior with plugins.

Bifrost

Built by Maxim AI in Go, Bifrost is designed for raw performance. It claims less than 11 microseconds of overhead at 5,000 requests per second — roughly 50x faster than Python-based alternatives. It supports 1,000+ models across 15+ providers.

Key features include adaptive load balancing, cluster mode for horizontal scaling, semantic caching, hierarchical cost controls, and MCP client/server support. Apache 2.0 licensed.

Strengths: Fastest measured latency of any gateway. Written in Go for minimal resource consumption. Fully open source with no commercial license. Semantic caching. MCP security model ("suggest, don't execute").

Limitations: Relatively new — less battle-tested than LiteLLM. Smaller community. Most performance benchmarks come from Maxim's own marketing. Fewer third-party integrations.

Best for: High-throughput, latency-sensitive applications where every microsecond matters. Teams with Go expertise that want a lightweight, fast gateway.

Braintrust Proxy

A free AI proxy that doubles as the integration point for Braintrust's evaluation platform. It provides an OpenAI-compatible API across 100+ models with automatic caching (AES-GCM encrypted, sub-100ms hits), load balancing, and observability through Braintrust's tracing.

Strengths: Completely free, even without a Braintrust account. Tightly integrated with evals and logging. WebSocket/Realtime API support. Unified reasoning model parameters across providers.

Limitations: Explicitly "intended for development and testing" — no production SLAs. Subject to rate limiting and service interruptions. Limited governance features. A complement to the eval platform, not a standalone gateway.

Best for: Development and testing environments. Teams already using or evaluating Braintrust for LLM evaluation.

Kong AI Gateway

Kong adds AI-specific capabilities on top of their mature API management platform (38,000+ GitHub stars). Features include LLM routing, semantic caching, RAG pipeline automation at the gateway layer, PII sanitization across 12 languages, and usage analytics.

Available as managed SaaS (Konnect) or self-hosted Enterprise. Pricing is consumption-based at approximately $105/month per Gateway Service.

Strengths: Built on a battle-tested API management platform. PII sanitization in 12 languages. RAG pipeline automation. Both SaaS and self-hosted options.

Limitations: Traditional API management pricing (not optimized for AI-native teams). Requires Kong expertise. Added AI features increase configuration complexity.

Best for: Teams already using Kong for API management that want to extend it to AI traffic.

How to Choose

There's no single best AI gateway — the right choice depends on your constraints. Here's a decision framework, or skip ahead and let the AI Gateway Selector match you to a gateway based on your answers.

Start with your primary constraint

If your priority is...	Consider
Free / lowest cost	Cloudflare AI Gateway (free core), LiteLLM (MIT), Bifrost (Apache 2.0)
Zero operational overhead	Cloudflare, Vercel, OpenRouter
Enterprise governance (RBAC, budgets, compliance)	Portkey, Azure APIM, AWS Bedrock
Deep observability	Helicone, Portkey
Maximum provider coverage	Portkey (1,600+ LLMs), OpenRouter (290+ models)
Performance / low latency	Bifrost (<11μs), Helicone (1-5ms), Cloudflare (edge)
Data sovereignty / self-hosting	LiteLLM, Bifrost, Kong, Helicone, WideAreaAI (edge-first — routes to a GPU you own, bursts to cloud)
Semantic caching	Azure APIM, Bifrost, Kong
Existing cloud ecosystem	AWS Bedrock (AWS), Azure APIM (Azure), Apigee (GCP)
Next.js / Vercel stack	Vercel AI Gateway

Then narrow by team size

Solo developer / startup: Cloudflare AI Gateway (free) or OpenRouter (pay-per-token, no commitment). Add LiteLLM if you need self-hosting.
Small team (5-20 engineers): Cloudflare or Helicone for managed observability. LiteLLM if self-hosting matters.
Mid-size (20-100 engineers): Portkey for governance and multi-team cost tracking. Azure APIM if you're a Microsoft shop. Helicone Team for observability depth.
Enterprise (100+ engineers): Portkey Enterprise, Azure APIM Premium, or AWS Bedrock — depending on your cloud. Kong if you already run it for API management.

Questions to ask before committing

How many providers do you use today? If just one, a gateway is a nice-to-have. If two or more, it's nearly essential.
What's your monthly AI spend? Under $100/month, the free tier of Cloudflare or LiteLLM is plenty. Over $10K/month, invest in a gateway with real cost attribution.
Where does your data live? If you can't send prompt data through a third party, self-hosted is your only option.
Do you need to prove compliance? SOC 2, HIPAA, and similar requirements narrow the field to Portkey Enterprise, Azure APIM, AWS Bedrock, or self-hosted options.
How fast are you growing? If you'll outgrow the free tier quickly, factor in the paid tier pricing now. Cloudflare's $5/month Workers Paid is cheap. Azure APIM's $700+/month Standard V2 is not.

What's Next in This Series

This guide gives you the landscape. In upcoming posts, we'll go deeper with head-to-head comparisons:

Cloudflare AI Gateway vs Portkey vs Helicone — Managed gateway showdown with feature-by-feature analysis
LiteLLM vs Bifrost vs Kong — Open-source gateway comparison with benchmarks
AI Gateway for Enterprise — AWS Bedrock vs Azure APIM vs Google Apigee for regulated industries
Building an AI Gateway Strategy — Architecture patterns for production AI applications

Each comparison will include real configuration examples, cost modeling, and decision frameworks specific to each matchup.

Frequently Asked Questions

Is Cloudflare AI Gateway free?

The core features are free — caching, rate limiting, analytics, fallbacks, and guardrails. You get 100,000 requests per day and 100,000 persistent logs per month on the free tier. The $5/month Workers Paid plan increases this to 10 million requests per month and 1 million logs per month. You still pay each AI provider separately for their token costs — AI Gateway does not replace provider billing.

What's the difference between an AI gateway and an API gateway?

A traditional API gateway (like Kong, Nginx, or AWS API Gateway) handles general HTTP traffic — routing, authentication, rate limiting, and protocol translation for any API. An AI gateway is purpose-built for LLM and AI model traffic. It understands tokens (not just requests), can cache prompt-response pairs, supports model-specific fallback chains, tracks costs per model, and applies AI-specific guardrails like PII detection and content moderation. Some products blur the line — Azure APIM and Kong add AI-specific policies on top of their general API management.

Do AI gateways add latency to requests?

Yes, but the overhead varies dramatically. Bifrost (Go-based) adds less than 11 microseconds at 5,000 requests per second. Helicone (Rust-based) adds 1-5ms at P95. Managed services like Cloudflare AI Gateway run on edge networks so the proxy hop is minimal. OpenRouter adds 25-40ms. The latency trade-off is usually worth it — caching alone can reduce response times by 90% for repeated queries, and fallback routing prevents total failures when a provider goes down.

Can I use an AI gateway with streaming responses?

Most AI gateways support streaming (SSE — Server-Sent Events) for chat completions. Cloudflare AI Gateway, Portkey, LiteLLM, and Helicone all pass through streaming responses. Some features like caching and guardrails may behave differently with streaming — for example, content moderation may only apply after the full response is assembled, not per-chunk. Cloudflare's universal endpoint also supports WebSocket connections for the OpenAI Realtime API.

Which AI gateway supports the most providers?

Portkey claims 1,600+ LLMs across its supported providers. OpenRouter offers 290+ models accessible through a single API. Bifrost and LiteLLM each support 100+ providers and claim 1,000+ models. Cloudflare AI Gateway natively supports 24 providers but also allows custom provider configuration for any HTTPS-based AI API. The right question is usually not 'which supports the most?' but 'does it support the 2-3 providers I actually use?'

Should I self-host or use a managed AI gateway?

Self-host (LiteLLM, Bifrost, Kong) if you need full data sovereignty, have strict compliance requirements, want zero vendor lock-in, or have the infrastructure team to maintain it. Use managed (Cloudflare, Portkey, Helicone, Vercel) if you want zero operational overhead, faster setup, and built-in global distribution. The middle ground is hybrid deployment — Portkey and Kong offer both SaaS and self-hosted options.

Do AI gateways support semantic caching?

Most AI gateways today only support exact-match caching — the prompt must be character-for-character identical to return a cached response. Azure APIM is the notable exception, offering semantic caching via Azure Managed Redis with vector similarity search. Cloudflare AI Gateway has semantic caching on its roadmap but hasn't shipped it yet. Bifrost and Kong also offer semantic caching. For most production workloads, exact-match caching still delivers significant cost savings on repeated queries like system prompts and common questions.

Can I use multiple AI gateways together?

Technically yes, but it's rarely practical. You could use LiteLLM for provider abstraction and route its output through Cloudflare AI Gateway for edge caching and analytics, for example. But each additional proxy layer adds latency and complexity. Most teams are better served picking one gateway that covers their primary needs rather than chaining multiple gateways together.

What happens if my AI gateway goes down?

This depends on how you've integrated. If you're using a managed gateway as a simple proxy (URL swap), a gateway outage means your AI calls fail — you should implement client-side fallback to call providers directly as a safety net. Self-hosted gateways give you more control but require your own high-availability setup. Cloudflare AI Gateway runs on their global edge network with built-in redundancy. Portkey offers a 99.99% uptime SLA on enterprise plans.

Are AI gateways necessary for single-provider apps?

Not strictly necessary, but still valuable. Even if you only use OpenAI, an AI gateway gives you request caching (cost savings), analytics (usage visibility), rate limiting (abuse prevention), and guardrails (content safety). The real value shows up when you want to switch providers or add a fallback — with a gateway, it's a configuration change instead of a code rewrite. If you're prototyping or building a low-volume internal tool, calling the provider directly is perfectly fine.

ai gatewayllmcloudflareawsazuregcpcloud comparisonai infrastructureopenaianthropicmachine learning

AI Gateway Guide: What They Are, Why You Need One, and How to Choose

Frequently Asked Questions

Free tools you can use right now

Related articles

Serverless Showdown: Cloudflare Workers vs Lambda vs Cloud Functions vs Azure Functions

Cloud Provider Comparison: Cloudflare vs AWS vs Azure vs Google Cloud — The Complete Guide

Cloud Pricing Decoded: How Cloudflare, AWS, Azure, and Google Cloud Actually Charge You