Understanding the Service Level Hierarchy

If you've worked with SLAs, SLOs, and SLIs, you've probably encountered confusion about what each term means and how they relate. These acronyms sound similar but represent fundamentally different concepts in service reliability management.

This confusion has real consequences. Teams set unrealistic SLAs, over-commit to customers, burn out from chasing 100% uptime, or under-invest in reliability and face customer churn. Understanding the distinction between SLI, SLO, and SLA—and how to use them effectively—is essential for building sustainable, reliable services.

Let's break down each concept and learn how to apply Google's SRE (Site Reliability Engineering) best practices.

What is an SLI (Service Level Indicator)?

An SLI is a quantitative measurement of a specific aspect of service performance. It's what you actually measure.

Common SLIs

Availability:

SLI = (Successful Requests / Total Requests) × 100%

Latency (95th percentile):

SLI = "95% of requests complete in <200ms"

Error Rate:

SLI = (Failed Requests / Total Requests) × 100%

Throughput:

SLI = Requests per Second

Good SLI Characteristics

A good SLI should be:

1. User-centric: Measures what users actually experience

✅ Good: "HTTP 200 response with valid data <500ms"
❌ Bad: "CPU utilization <70%"

2. Measurable: Can be calculated from real data

✅ Good: "Request success rate from load balancer logs"
❌ Bad: "Users are happy with performance"

3. Actionable: Points to specific improvements

✅ Good: "P95 API latency from distributed tracing"
❌ Bad: "Overall system health score"

4. Comprehensive: Covers the full user journey

✅ Good: "End-to-end transaction success (login → purchase → confirmation)"
❌ Bad: "Database query success rate"

Example SLI Definitions

E-commerce checkout flow:

SLI: Successful Checkout Rate
Definition: |
  (Successful checkouts / Total checkout attempts) × 100%

  Success criteria:
  - Payment processed
  - Confirmation email sent
  - Order appears in customer account
  - HTTP 200 response
  - Page load <3 seconds

Measurement:
  Source: Application logs + payment gateway webhooks
  Aggregation: Per-minute buckets, rolled up to hourly
  Exclusions: Admin test transactions, fraud-flagged attempts

API service:

SLI: Request Success Rate
Definition: |
  (Valid responses / Total requests) × 100%

  Valid response criteria:
  - HTTP 200-299 status
  - Response time <500ms
  - Response body validates against schema
  - No internal server errors in logs

Measurement:
  Source: Load balancer access logs
  Aggregation: 10-minute rolling windows
  Exclusions: Health checks, monitoring probes

What is an SLO (Service Level Objective)?

An SLO is your internal target for an SLI. It defines the threshold of acceptable performance.

SLO Format

SLO = Target Value + Time Window + Measurement Method

Examples

Availability SLO:

"99.9% of HTTP requests will return 2xx status codes,
measured over rolling 30-day windows"

Latency SLO:

"95% of API requests will complete in <200ms,
measured over rolling 7-day windows"

Error Budget SLO:

"We can tolerate 43.8 minutes of downtime per month
(100% - 99.9% = 0.1% error budget)"

Setting Realistic SLOs

Start with current performance:

Current performance (last 90 days): 99.85% availability

Initial SLO: 99.5% (buffer below current performance)
After 3 months of stability: Tighten to 99.7%
After 6 months: Tighten to 99.9%

Don't over-commit:

❌ Wrong: "We'll achieve 99.99% because competitors do"
✅ Right: "We'll target 99.9% based on our current 99.85% and business needs"

Balance cost and benefit:

Availability	Annual Downtime	Infrastructure Cost	Staffing Need
99%	3.65 days	Low	Standard
99.9%	8.76 hours	Medium	On-call rotation
99.95%	4.38 hours	High	24/7 on-call
99.99%	52.6 minutes	Very high	Multiple 24/7 teams
99.999%	5.26 minutes	Extreme	Dedicated SRE team

The Cost of Each Additional Nine

Going from three nines (99.9%) to four nines (99.99%) means:

Allowed downtime drops:

99.9% = 43.8 minutes/month
99.99% = 4.38 minutes/month
Reduction: 39.4 minutes (90% less downtime)

Infrastructure cost increases:

Estimated cost multiplier: 3-5×
- Multi-region deployment required
- Redundancy at every layer
- Automated failover systems
- Advanced monitoring and alerting

Operational complexity increases:

- 24/7 on-call rotation with <5 min response SLA
- Extensive runbook documentation
- Regular incident response drills
- Chaos engineering testing
- Post-incident review processes

What is an SLA (Service Level Agreement)?

An SLA is your contractual promise to customers, typically with financial consequences (service credits) for violations.

SLA Components

1. Service Level:

"99.9% monthly uptime, excluding scheduled maintenance"

2. Measurement Method:

"Uptime measured as successful health checks every 60 seconds
from 5 global monitoring locations"

3. Service Credits (penalties for violations):

If monthly uptime is:
- 99.0% to 99.9%: 10% service credit
- 95.0% to 99.0%: 25% service credit
- Below 95.0%: 100% service credit

4. Exclusions:

SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Customer-caused issues (invalid API calls, quota exceeded)
- Force majeure events (natural disasters, acts of war)
- Third-party service failures beyond our control

5. Credit Claiming Process:

- Customer must file claim within 30 days
- Credits applied to next invoice
- Credits are exclusive remedy (no refunds)

SLA Best Practices

Leave buffer between SLO and SLA:

Internal SLO: 99.95% availability
Customer SLA: 99.9% availability
Buffer: 0.05% (22 minutes/month safety margin)

This buffer allows:

Operational mistakes without SLA violations
Maintenance activities
Graceful degradation scenarios
Learning from incidents without financial penalties

Tier your SLAs:

Basic Plan:
  SLA: 99.0% uptime
  Support: Business hours
  Credits: 10% below SLA

Professional Plan:
  SLA: 99.9% uptime
  Support: 24/7 email
  Credits: 10% at 99-99.9%, 25% at 95-99%, 100% below 95%

Enterprise Plan:
  SLA: 99.95% uptime
  Support: 24/7 phone + dedicated account manager
  Credits: 25% at 99.95-99.9%, 50% at 99-99.95%, 100% below 99%
  Priority: Dedicated capacity, early access to features

Real-world SLA examples:

AWS EC2:

SLA: 99.99% monthly uptime (multi-AZ)
Credit: 10% at 99-99.99%, 30% at 95-99%, 100% below 95%
Measurement: Per-region basis

Google Cloud Compute Engine:

SLA: 99.99% monthly uptime (multi-zone)
Credit: 10% at 99-99.99%, 25% at 95-99%, 50% below 95%
Measurement: Per-zone basis

Stripe Payments API:

SLA: 99.99% uptime
Credit: Negotiated per enterprise agreement
Measurement: Successful API requests

The Relationship: SLI → SLO → SLA

Think of these as a hierarchy:

┌─────────────────────────────────────────┐
│ SLI (What you measure)                  │
│ • Request success rate                  │
│ • API latency (p95)                     │
│ • Error rate                            │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│ SLO (Your internal goal)                │
│ • 99.95% request success rate           │
│ • p95 latency <200ms                    │
│ • Error rate <0.1%                      │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│ SLA (Your customer promise)             │
│ • 99.9% uptime guarantee                │
│ • Service credits if violated           │
│ • Excludes scheduled maintenance        │
└─────────────────────────────────────────┘

Example Flow

Measurement (SLI):

Week 1: 99.97% request success rate
Week 2: 99.93% request success rate
Week 3: 99.88% request success rate ⚠️
Week 4: 99.91% request success rate

Monthly average: 99.92% ✅

Internal Target (SLO):

Target: 99.95% monthly
Actual: 99.92%
Status: MISSED (by 0.03%)

Action: Root cause analysis, improve monitoring

Customer Promise (SLA):

Guarantee: 99.9% monthly
Actual: 99.92%
Status: MET ✅

Result: No service credits owed

The 0.05% buffer (99.95% SLO vs 99.9% SLA) saved you from a customer-facing violation despite missing your internal target.

Error Budgets: The Key to Balancing Reliability and Velocity

An error budget is the amount of unreliability you're willing to accept. It's the inverse of your SLO.

Calculating Error Budget

Error Budget = 100% - SLO

For 99.9% SLO:

Error Budget = 100% - 99.9% = 0.1%

In a 30-day month:
0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes

For 99.99% SLO:

Error Budget = 100% - 99.99% = 0.01%

In a 30-day month:
0.01% × 30 days × 24 hours × 60 minutes = 4.32 minutes

Using Error Budgets for Decision Making

Healthy error budget (50%+ remaining):

✅ Deploy new features with normal risk
✅ Run experiments and A/B tests
✅ Perform non-critical maintenance
✅ Test chaos engineering scenarios

Moderate error budget (20-50% remaining):

⚠️ Increase testing rigor
⚠️ Require production readiness reviews
⚠️ Postpone risky changes to off-peak hours
⚠️ Add extra monitoring to new deployments

Depleted error budget (<20% remaining):

🚨 Freeze feature deployments
🚨 Focus on reliability improvements
🚨 Conduct incident reviews and fix root causes
🚨 Tighten change approval process

Error Budget Policy Example

Error Budget Policy:

SLO: 99.9% request success rate (monthly)
Error Budget: 0.1% (43.2 minutes downtime/month)

Thresholds:
  - 80-100% remaining: Normal operations
    Actions: Standard deployment cadence, experiment freely

  - 50-80% remaining: Caution
    Actions: Require production readiness review for risky changes

  - 20-50% remaining: Alert
    Actions:
      - Postpone non-critical deployments
      - Conduct root cause analysis on recent incidents
      - Add monitoring to detect issues earlier

  - 0-20% remaining: Lockdown
    Actions:
      - Freeze all feature deployments
      - All-hands focus on reliability
      - Daily incident reviews
      - Executive reporting

  - Exhausted (negative): Emergency
    Actions:
      - All engineering resources on reliability
      - Daily executive briefings
      - Customer communication plan
      - SLA credit calculations

Burn Rate: Measuring Error Budget Consumption

Burn rate measures how fast you're consuming your error budget.

Calculating Burn Rate

Burn Rate = (Actual Error Rate / Allowed Error Rate)

Example:

SLO: 99.9% (allowed error rate: 0.1%)
Current error rate: 0.2%

Burn Rate = 0.2% / 0.1% = 2

Interpretation: You're consuming error budget twice as fast as planned. At this rate, you'll exhaust your monthly budget in 15 days instead of 30.

Multi-Window Burn Rate Alerting

Google SRE recommends multi-window, multi-burn-rate alerting to balance speed and precision:

Fast burn alert (1-hour window):

Burn rate > 36 over 1 hour → Page on-call immediately

Why 36?
- Exhausts 30-day budget in 20 hours
- Indicates severe outage
- Requires immediate response

Slow burn alert (6-hour window):

Burn rate > 6 over 6 hours → Create ticket for investigation

Why 6?
- Exhausts 30-day budget in 5 days
- Indicates chronic issue
- Requires root cause analysis

Setting Up Burn Rate Alerts

Prometheus example:

# Fast burn: Page immediately
- alert: ErrorBudgetBurnRateCritical
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) / 0.001 > 36
  labels:
    severity: page
    error_budget: critical
  annotations:
    summary: "Critical: Error budget burning 36× faster than sustainable"
    description: "At this rate, monthly budget exhausted in 20 hours"

# Slow burn: Create ticket
- alert: ErrorBudgetBurnRateWarning
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) / 0.001 > 6
  labels:
    severity: ticket
    error_budget: warning
  annotations:
    summary: "Warning: Error budget burning 6× faster than sustainable"
    description: "At this rate, monthly budget exhausted in 5 days. Investigate chronic issues."

Multiple SLOs Per Service

Most services need 2-4 SLOs covering different reliability dimensions:

Example: E-Commerce Checkout Service

SLO 1: Availability

99.95% of checkout requests return valid responses
Measurement: Load balancer success rate
Time window: Rolling 30 days

SLO 2: Latency (p95)

95% of checkout requests complete in <500ms
Measurement: Application instrumentation
Time window: Rolling 7 days

SLO 3: Latency (p99)

99% of checkout requests complete in <2 seconds
Measurement: Application instrumentation
Time window: Rolling 7 days

SLO 4: Error Rate

<0.1% of checkouts result in payment errors
Measurement: Payment gateway webhook success
Time window: Rolling 30 days

Why Multiple SLOs?

Single SLO problems:

❌ Only measuring availability misses:
   - Slow responses (users abandon)
   - Partial failures (cart but no confirmation)
   - Specific error types (payment vs inventory)

Multiple SLOs capture full picture:

✅ Availability: "Is it working?"
✅ Latency: "Is it fast enough?"
✅ Correctness: "Is the result accurate?"
✅ Freshness: "Is the data current?"

Composite SLOs

For complex user journeys, measure end-to-end:

Composite SLO: Successful User Journey
Definition: User can complete full purchase flow

Steps:
  1. Browse products (latency <1s, 99.9% success)
  2. Add to cart (latency <200ms, 99.95% success)
  3. Checkout (latency <500ms, 99.95% success)
  4. Payment (latency <2s, 99.9% success)
  5. Confirmation (latency <500ms, 99.95% success)

Overall SLO: 99.7%
(Product of individual steps: 0.999 × 0.9995 × 0.9995 × 0.999 × 0.9995)

The Cost of Downtime

SLOs and SLAs directly impact revenue through downtime costs.

Industry Benchmarks (Gartner 2024)

Average cost by company size:

Small business: $137-$427 per minute
Medium business: $2,300 per minute
Large enterprise: $5,600+ per minute

By industry:

E-commerce: $17,000 per minute (peak season)
Financial services: $9,000 per minute
Healthcare: $8,000 per minute
Manufacturing: $6,000 per minute
SaaS: $5,000 per minute

Calculating Your Downtime Cost

Hourly Revenue = Annual Revenue / 8,760 hours
Cost per Minute = Hourly Revenue / 60
Monthly Downtime Cost = Allowed Downtime Minutes × Cost per Minute

Example: $10M ARR SaaS company:

Hourly Revenue = $10,000,000 / 8,760 = $1,141
Cost per Minute = $1,141 / 60 = $19

For 99.9% SLO (43.2 min/month allowed):
Monthly Downtime Cost = 43.2 × $19 = $821

For 99.99% SLO (4.32 min/month allowed):
Monthly Downtime Cost = 4.32 × $19 = $82

Difference: $739/month = $8,868/year

Hidden Costs

Direct revenue is just the start:

Customer churn:

25% of customers won't return after poor experience
Customer Lifetime Value × Churn Rate = Lost revenue

Brand reputation:

Social media amplification
Long-term trust damage
PR crisis management costs

SLA penalties:

Service credits owed
Future business lost
Contractual disputes

Recovery costs:

All-hands incident response
Overtime and emergency support
Vendor emergency fees

Practical SLO Implementation Guide

Step 1: Choose Your SLIs

Start with 1-2 critical user-facing metrics:

Priority 1: Availability
"Can users access the service?"

Priority 2: Latency
"Is the service fast enough to be useful?"

Step 2: Gather Baseline Data

Measure current performance for 30-90 days:

# Query your monitoring system
# Calculate p50, p95, p99 latency
# Calculate success rates
# Identify patterns (time of day, day of week)

Step 3: Set Initial SLOs

Set targets below current performance:

Current: 99.85% availability (last 90 days)
Initial SLO: 99.5% (safe buffer)

Current: p95 latency 150ms
Initial SLO: p95 < 200ms (safe buffer)

Step 4: Define Error Budget Policy

Document what happens at each threshold:

## Error Budget Policy

SLO: 99.9% availability
Error Budget: 43.2 minutes/month

### 100-80% Budget Remaining
- Normal feature velocity
- Standard testing rigor

### 80-50% Budget Remaining
- Require production readiness reviews
- Increase monitoring coverage

### 50-20% Budget Remaining
- Defer risky changes
- Root cause analysis on incidents

### 20-0% Budget Remaining
- Feature freeze
- All-hands reliability focus

### Budget Exhausted
- Engineering lockdown
- Executive escalation

Step 5: Implement Monitoring and Alerting

Set up dashboards and alerts:

# SLO Dashboard
Panels:
  - Current SLI value
  - SLO target line
  - Error budget remaining (%)
  - Error budget remaining (time)
  - Burn rate (1h, 6h, 24h)
  - Historical trend (30 days)

# Alerts
Critical:
  - Burn rate > 36 (1h window) → Page
Warning:
  - Burn rate > 6 (6h window) → Ticket
  - Error budget < 20% → Slack notification

Step 6: Review and Iterate

Monthly review:

1. Did we meet SLO? ✅ or ❌
2. How much error budget used? X%
3. Were error budget policies followed?
4. Any SLO violations? Root causes?
5. Should we tighten/loosen SLO?

Quarterly adjustments:

If consistently exceeding SLO with budget to spare:
→ Tighten SLO (e.g., 99.5% → 99.7%)

If consistently missing SLO:
→ Loosen SLO or invest in reliability

Common Mistakes and How to Avoid Them

Mistake 1: Setting SLO = SLA

❌ Wrong:
Internal SLO: 99.9%
Customer SLA: 99.9%
Buffer: 0%

Problem: One incident = SLA violation = service credits

✅ Right:
Internal SLO: 99.95%
Customer SLA: 99.9%
Buffer: 0.05% (safety margin)

Mistake 2: Too Many SLOs

❌ Wrong: 15 SLOs per service
Result: Alert fatigue, unclear priorities

✅ Right: 2-4 SLOs per service
Focus: User-facing, actionable metrics

Mistake 3: Ignoring Error Budgets

❌ Wrong: Budget depleted, still shipping features
Result: SLA violations, customer churn

✅ Right: Enforce error budget policy
Pause features, focus on reliability

Mistake 4: Internal-Focused SLIs

❌ Wrong: "CPU utilization <70%"
Problem: Users don't care about CPU

✅ Right: "99.9% of requests succeed in <500ms"
Focus: User experience

Mistake 5: Unrealistic SLOs

❌ Wrong: "We'll achieve 99.99% because competitors do"
Problem: Can't deliver, sets up for failure

✅ Right: "We'll target 99.9% based on current 99.85% and business needs"
Realistic: Achievable with current resources

Conclusion

Understanding SLI, SLO, and SLA—and how they work together—is fundamental to building reliable services at scale.

Key principles:

SLI = What you measure: User-facing, quantitative metrics
SLO = Your internal goal: Realistic targets based on current performance and business needs
SLA = Your customer promise: Contractual commitments with financial consequences
Buffer = SLO - SLA: Safety margin for operational mistakes (typically 0.05-0.1%)
Error budget = 100% - SLO: Amount of unreliability you're willing to accept
Burn rate = Error rate / Allowed rate: How fast you're consuming error budget
2-4 SLOs per service: Cover availability, latency, correctness, freshness
Multi-window alerts: Fast burn (1h) for critical, slow burn (6h) for chronic issues

Start simple: Pick 1-2 critical SLIs, measure baseline performance for 30-90 days, set conservative initial SLOs, implement error budget policy, and iterate based on data. Don't chase unrealistic SLOs (99.99%+) unless business truly requires it—the cost increases exponentially with each additional nine.

The goal isn't 100% uptime (impossible and unsustainable). The goal is the right balance between reliability and velocity—delivering value to customers while maintaining acceptable service levels.

Ready to calculate your SLOs, error budgets, and downtime costs? Try our SLA/SLO Calculator to model different availability targets, compare burn rates, and get personalized recommendations for your service level objectives.

SLA vs SLO vs SLI: What's the Difference and Why It Matters