Home/Case Studies/Building an Incident Management Platform on Cloudflare's Edge

Cloud SolutionsSaaS / DevOps

Building an Incident Management Platform on Cloudflare's Edge

How we designed and built Alert24 — an all-in-one monitoring, on-call scheduling, incident management, and status page platform — entirely on Cloudflare's edge infrastructure, replacing a $200+/month multi-vendor stack with a single $24/month platform.

Tools Replaced

$24/mo

Starting Price

Zero

Cold Starts

None

Origin Servers

Overview

Every growing SaaS company and DevOps team hits the same wall: monitoring lives in one tool, on-call scheduling in another, and status pages in a third. You're paying three vendors, managing three sets of credentials, and stitching together three different data models with webhooks and duct tape. When AWS goes down, someone has to manually check the status page, create an incident, update the status page, and notify the on-call engineer — all while customers are already tweeting about it. We wanted a single platform that handled the entire incident lifecycle from detection to resolution to public communication, with one bill and one login.

So we built it. Alert24 is an all-in-one monitoring, on-call scheduling, incident management, and status page platform built entirely on Cloudflare's edge infrastructure. It replaces the PagerDuty + Pingdom + Statuspage stack with a single application — and adds a feature no one else offers: automatic cloud provider outage detection that updates your status page before your customers even notice.

The Challenges

Building a platform that replaces three established products required solving interconnected technical and product challenges — all while keeping the platform fast, affordable, and simple enough that a three-person startup could adopt it in an afternoon.

Real-Time Monitoring Without Origin Servers

Uptime monitoring demands consistent, low-latency checks from multiple geographic locations. Traditional monitoring platforms run dedicated check servers in each region — expensive infrastructure that sits idle between checks. The platform needed to execute HTTP, TCP, DNS, and SSL certificate checks on configurable intervals (down to 30 seconds) without maintaining always-on servers in multiple regions. It also needed to detect subtle failures like keyword absence, unexpected status codes, and certificate expiration windows — not just binary up/down states.

Multi-Tier Escalation with Time-Sensitive Delivery

When a service goes down at 3 AM, the difference between a five-minute response and a thirty-minute response is the difference between a blip and an outage. The platform needed multi-tier escalation policies that automatically promote incidents through notification levels when the current on-call engineer doesn't acknowledge within a configurable timeout. This required coordinating three notification channels — email, SMS, and voice calls — with precise timing, retry logic, and graceful degradation if one channel fails. A missed SMS can't block the voice call escalation.

Cloud Provider Auto-Sync: The Feature No One Else Builds

The most common root cause of customer-facing outages isn't your own code — it's your cloud provider. AWS, Azure, and GCP all publish status pages, but they update slowly and inconsistently. Most teams find out about a cloud provider incident from Twitter, not from their monitoring tool. We needed the platform to continuously monitor cloud provider status pages, automatically detect when services your application depends on are degraded, create incidents, and update your public status page — all without human intervention. This meant building a scraping and correlation engine that maps your infrastructure dependencies to specific cloud provider services and regions.

Multi-Tenant Isolation for Sensitive Incident Data

Incident management systems contain some of the most sensitive operational data in an organization: who was on call, how long it took to respond, what went wrong, and who was responsible. The platform needed strict multi-tenant isolation with role-based access control (Owner, Admin, Responder, Viewer), team-scoped escalation policies, and audit logging — all enforced at the query level in a shared database. A misconfigured API call from one organization must never leak incident data to another.

Public Status Pages with Zero Configuration

Status pages are trust infrastructure. When a customer checks your status page during an outage and sees "All Systems Operational" because your team hasn't manually updated it yet, you've lost credibility. The platform needed to generate public status pages that update automatically from incident state changes, support custom branding (logos, colors, custom CSS), custom domains via CNAME, email subscriptions for status updates, and component-level granularity — all without requiring customers to manage a separate deployment or CMS.

The Solution

We designed Alert24 as a Cloudflare-native application from the ground up, running the entire platform on Workers, D1, and the broader Cloudflare developer platform — no origin servers, no container orchestration, no database connection pools.

Next.js on Cloudflare Workers via Edge Runtime

The entire application — dashboard, API, public status pages, and marketing site — runs as a single Next.js 14 application deployed to Cloudflare Pages. Every API route uses the Edge Runtime, meaning requests are handled at the nearest Cloudflare data center with sub-10ms cold starts. Authentication uses NextAuth.js v5 (beta) with JWT-based sessions, chosen specifically because it doesn't require server-side session storage — a critical constraint for stateless edge execution. The migration from NextAuth v4 required building a custom SessionManager that extracts and validates JWTs from cookies without the full NextAuth middleware stack.

Cloudflare D1 for Multi-Tenant Relational Data

All platform data lives in Cloudflare D1 — a SQLite-based edge database that co-locates with the Worker for sub-millisecond query latency. The schema spans 20+ tables covering organizations, users, team membership, monitoring checks, check results, incidents, incident updates, escalation policies, on-call schedules, notifications, status pages, webhooks, and audit logs. Multi-tenant isolation is enforced at the query level: every query includes an organization_id filter, and the database router resolves the correct organization context from the authenticated session before any data access occurs. Comprehensive indexing on foreign keys and query fields keeps read performance consistent as data grows.

Cron-Driven Monitoring and Escalation Engine

Two scheduled Workers form the heartbeat of the platform. The monitoring cron executes every configured check — making HTTP requests, opening TCP connections, querying DNS records, and validating SSL certificates — then compares results against previous state. When a check transitions from healthy to failed (with configurable retry thresholds to prevent flapping), the engine automatically creates an incident, matches it to the relevant escalation policy, and triggers the first notification tier. A separate escalation cron runs every two minutes, scanning for incidents where the current escalation level has timed out without acknowledgment, and promoting them to the next tier. Both crons are idempotent — safe to run concurrently without duplicating incidents or notifications.

Multi-Channel Notification Pipeline

The notification service coordinates delivery across three channels with independent failure handling. Email notifications route through PushMail (our own edge-native email platform, also built on Cloudflare) for sub-second delivery without SDK overhead. SMS and voice call notifications use the Twilio SDK, with each channel operating independently so a Twilio API failure doesn't block email delivery. The pipeline implements parallel dispatch with per-channel retry logic and exponential backoff. Notification state (sent, delivered, acknowledged) is tracked in D1 for audit compliance and escalation decision-making. Each notification includes a one-click acknowledgment link that halts further escalation.

Cloud Provider Status Monitoring

The auto-sync engine continuously polls the public status endpoints of AWS, Azure, and Google Cloud Platform. When a cloud provider reports degradation or an outage for a service and region that maps to a customer's configured dependencies, the platform automatically creates an incident with the cloud provider's status information, updates the customer's public status page, and notifies the on-call team — typically before the team has even checked Twitter. Customers configure their cloud dependencies through a simple UI that maps their services to specific cloud provider services and regions (e.g., "Our API depends on AWS us-east-1 EC2 and RDS"). The correlation engine handles the mapping between cloud provider service names and customer-defined services.

Auto-Updating Public Status Pages

Status pages are served as statically generated pages with dynamic incident data. Each organization gets a status page at a subdomain (status.yourcompany.com) with custom branding — logo, brand colors, and optional custom CSS. The page displays real-time service health derived directly from monitoring check state and active incidents, eliminating the manual "update the status page" step that most teams forget during an actual incident. Email subscribers receive automated notifications when incident state changes. Status pages support component-level granularity, so customers can report that "API" is degraded while "Dashboard" remains operational — critical for accurate communication during partial outages.

Outbound Webhooks with Payload Templating

For teams that need to integrate Alert24 into existing workflows, the platform supports outbound webhooks with configurable event filters and payload templating. Webhooks can fire on incident creation, escalation, acknowledgment, or resolution — delivering customizable JSON payloads to any HTTP endpoint. Authentication supports basic auth, bearer tokens, and custom headers. Delivery attempts are tracked with retry logic, and the webhook log provides full request/response visibility for debugging integration issues.

The Results

The Cloudflare-native architecture delivered measurable advantages across performance, cost, and operational simplicity — both for us as the platform builders and for our customers.

Alert24

Three Tools Replaced, One Bill

replaces the monitoring + on-call + status page stack that typically costs $200+ per month with separate vendors. The Pro plan starts at $24 per month (3 units at $8 each), with each unit including a user seat, monitoring checks, status page capacity, and SMS/voice credits. A free tier with email-only notifications lets teams evaluate the full platform without a credit card. The unit-based pricing model means everything scales together — no surprise per-incident fees, no per-subscriber charges on status pages, no separate line items for each capability.

The

Zero Origin Servers, Global Edge Performance

entire platform — API, dashboard, monitoring engine, escalation processor, status pages, and webhook dispatcher — runs without a single origin server. There are no EC2 instances to right-size, no Kubernetes pods to scale, no database connections to pool. Workers execute with sub-10ms cold starts, and D1 queries return in under 1ms. An engineer in Sydney gets the same dashboard response time as an engineer in New York. Monitoring checks execute from Cloudflare's edge network, providing geographic diversity without managing check infrastructure in multiple regions.

The

Automatic Cloud Outage Detection

cloud provider auto-sync feature — monitoring AWS, Azure, and GCP status pages and automatically correlating outages with customer dependencies — is a genuine market differentiator. No other platform in the monitoring space offers automatic status page updates triggered by upstream cloud provider incidents. For customers running on public cloud, this eliminates the most common failure mode: finding out about a cloud outage from a customer complaint instead of from your monitoring tool. Status pages update within minutes of a cloud provider announcing degradation, often before the on-call team has even opened their laptop.

Alert24's

PushMail Integration: Eating Our Own Cooking

email notification pipeline runs through PushMail, our own edge-native email platform (also a Cloudflare case study). This means incident notifications, escalation alerts, status page subscriptions, and team invitations all route through infrastructure we built and control — no third-party email API dependency that could fail during the exact kind of incident Alert24 is designed to manage. The integration demonstrates both platforms working together: PushMail handles delivery reliability, and Alert24 handles incident orchestration.

Alert24

10+ Cloudflare Products in Production

leverages Cloudflare Workers, D1, Pages, Cron Triggers, Custom Domains, and the edge runtime — demonstrating that Cloudflare's developer platform can support complex, multi-service applications with real-time operational requirements. The monitoring engine, escalation processor, notification pipeline, and status page renderer all run as edge functions communicating through native D1 bindings. The entire infrastructure is defined in a single wrangler.toml file and deploys with one command. Combined with PushMail for email delivery, the total Cloudflare product footprint across both platforms exceeds 15 distinct services — all running without a single traditional server.

Ready to Achieve Similar Results?

Let our team of experts help you solve your toughest challenges and achieve transformational results.

Schedule a Free Consultation Explore Our Services

Modernizing a 3-Tier Web Application with GKE

Modernize a 3-tier web app with GKE: cut compute costs by 90% and boost scalability with Linux migration and custom scaling.

Autoscaling for Black Friday Traffic Surges

How autoscaling helped an eCommerce client cut costs by 85% and handle Black Friday traffic spikes seamlessly.

Cutting IT Costs by 38% with Cloud Migration

Learn how cloud migration cut IT costs by 38% with NetApp Cloud Volumes, boosting scalability and efficiency.

Back to All Case Studies