Home/Glossary/High Availability (HA)

High Availability (HA)

A system design approach that ensures a specified level of operational performance, typically measured in "nines" of uptime percentage.

InfrastructureAlso called: "ha", "fault tolerance", "uptime", "system availability"

High availability engineering eliminates single points of failure so that systems remain accessible even when individual components fail.

Why it matters

  • Modern businesses depend on 24/7 system availability.
  • Downtime costs range from thousands to millions per hour.
  • SLAs often require 99.9% or higher uptime guarantees.
  • Customer experience suffers from even brief outages.

The "nines" of availability

  • 99% (two nines): 3.65 days downtime/year
  • 99.9% (three nines): 8.76 hours downtime/year
  • 99.99% (four nines): 52.6 minutes downtime/year
  • 99.999% (five nines): 5.26 minutes downtime/year
  • 99.9999% (six nines): 31.5 seconds downtime/year

HA design principles

  • Redundancy: Duplicate critical components (servers, storage, network paths).
  • Failover: Automatic switching to standby systems when primary fails.
  • Load balancing: Distribute traffic across multiple instances.
  • Geographic distribution: Spread across data centers/regions.
  • Health monitoring: Detect failures quickly to trigger failover.

Common HA patterns

  • Active-passive: Standby takes over only when primary fails.
  • Active-active: All nodes serve traffic simultaneously.
  • N+1 redundancy: One extra instance beyond minimum required.
  • 2N redundancy: Double the required capacity.

Implementation considerations

  • Database replication and clustering.
  • Stateless application design for easy scaling.
  • Session management across instances.
  • DNS failover or global load balancing.
  • Chaos engineering to test failure scenarios.
  • Monitoring and alerting for rapid incident response.

Trade-offs

  • Higher complexity and operational overhead.
  • Increased infrastructure costs.
  • Potential for split-brain scenarios in distributed systems.
  • Need for thorough testing of failover mechanisms.