Availability, Reliability & SLAs
What "five nines" really means, and the promises behind SLA / SLO / SLI.
Availability is the percentage of time a system can serve successful requests. Reliability is whether it behaves correctly over time. SLAs, SLOs, and SLIs turn those ideas into measurable promises so teams know when the system is healthy, when to slow feature work, and when customers are owed remedies.
The problem: downtime is a product feature
Availability sounds like an operations detail, but it is part of the product contract. A payment API that is down blocks revenue. A chat app that is down loses trust. A background photo-enhancement feature might tolerate hours of downtime with little customer impact. The design must match the business consequence.
availability = successful time / total time
If a service is down for 45 minutes in a 30-day month:
total minutes = 30 × 24 × 60 = 43,200
available = 43,200 - 45 = 43,155
availability = 43,155 / 43,200
≈ 99.896%Ignoring availability creates a familiar failure mode: one database, one region, one queue, one deploy pipeline, or one human approval becomes asingle point of failure. The system works beautifully until that one dependency fails.
The nines: what availability targets allow
Availability targets are often described as "nines." Each additional nine roughly divides allowed downtime by ten.
| Availability | Common name | Max downtime / year | Max downtime / 30-day month |
|---|---|---|---|
| 99% | two nines | ~3.65 days | ~7.2 hours |
| 99.9% | three nines | ~8.76 hours | ~43.2 minutes |
| 99.99% | four nines | ~52.6 minutes | ~4.32 minutes |
| 99.999% | five nines | ~5.26 minutes | ~25.9 seconds |
What counts as down?
This is a product and contract question, not just a monitoring question. A service might be considered unavailable when requests return 5xx, when p99 latency exceeds a threshold, when writes fail but reads work, or when a regional subset of customers cannot access it.
good_request =
HTTP status is not 5xx
AND latency < 300 ms
AND response is not a known bad fallback
availability_sli =
count(good_request) / count(all_eligible_requests)SLA vs SLO vs SLI
These three terms form a stack. Start with what you measure, then set an internal target, then make a customer-facing promise.
| Term | Meaning | Audience | Example |
|---|---|---|---|
| SLI | Service Level Indicator: the actual metric | Engineers and dashboards | 99.94% of API requests were successful this week |
| SLO | Service Level Objective: the internal target | Product and engineering | Keep weekly success rate ≥ 99.9% |
| SLA | Service Level Agreement: the external promise | Customers and legal contract | If monthly uptime < 99.5%, customer receives service credits |
Real companies use this stack heavily. Google SRE popularized SLOs and error budgets. Cloud providers such as AWS, Azure, and Google Cloud publish SLAs for services like object storage, virtual machines, and managed databases, often with service credits when promises are missed.
Error budgets: spending unreliability on purpose
An error budget is the amount of failure your SLO permits during a window. If your SLO is 99.9% monthly availability, your allowed error budget is 0.1% of eligible requests or time. This converts reliability from a vague goal into a resource you can spend.
monthly SLO: 99.9% successful requests
monthly traffic: 200,000,000 requests
allowed failures = 0.1% × 200,000,000
= 200,000 failed or too-slow requests
If a bad deploy causes 150,000 failed requests,
the team has used 75% of the monthly error budget.- If the team is comfortably within budget, it can ship features and take controlled risks.
- If the budget is nearly exhausted, reliability work should outrank risky launches until the service is healthy again.
- Error budgets create a shared language between product teams that want speed and operations teams that want stability.
Availability math: series, parallel, MTBF, and MTTR
Availability of a full request path depends on how components combine. Components in series all have to work. Components in parallel or redundant groups can survive some failures.
request path:
client → load balancer → app service → database
If each required component is 99.9% available:
overall = 0.999 × 0.999 × 0.999
≈ 0.997
= 99.7%
Adding required dependencies lowers total availability.two independent replicas, either one can serve:
replica availability = 99%
replica failure = 1% = 0.01
both fail = 0.01 × 0.01 = 0.0001
group availability = 1 - 0.0001
= 99.99%| Shape | Formula intuition | Design implication |
|---|---|---|
| Series | A and B and C must all work | Every mandatory dependency reduces end-to-end availability |
| Parallel / redundant | A or B can work | Independent replicas, zones, or failover paths can raise availability |
| Shared dependency | Replicas depend on the same failing thing | Redundancy may be fake if all copies share one database, network, or region |
Another common lens is MTBF and MTTR. Mean Time Between Failures estimates how often failures occur. Mean Time To Repair estimates how long recovery takes. Availability improves when failures happen less often or recovery gets faster.
availability ≈ MTBF / (MTBF + MTTR)
System A:
fails every 1,000 hours, recovers in 1 hour
availability ≈ 1000 / 1001 = 99.90%
System B:
fails every 1,000 hours, recovers in 5 minutes
availability ≈ 1000 / 1000.083 = 99.9917%
Fast detection and automated recovery can be as valuable as preventing failures.Edge cases and gotchas
Availability engineering is full of traps because the math assumes independence and clear definitions. Real systems are messier.
- Correlated failures: two replicas in the same rack, region, or software deploy may fail together, so the parallel formula overstates availability.
- Brownouts: the system is not fully down, but it is so slow or degraded that users cannot complete tasks.
- Maintenance windows: contracts must specify whether planned maintenance counts against the SLA.
- Partial availability: reads may work while writes fail, one region may be down, or only large customers may be affected.
- Failover bugs: a passive standby is only useful if it is tested regularly. Untested disaster recovery is hope, not design.
- Availability measures successful service over time; reliability means the system behaves correctly when it responds.
- The nines translate directly into downtime budgets, and each extra nine costs significant complexity.
- SLI is the metric, SLO is the internal target, and SLA is the external promise with consequences.
- Error budgets turn reliability into a spendable resource that balances feature velocity and operational risk.
- Series dependencies multiply downward; independent parallel redundancy can improve availability, but shared failures break the math.
Mark it complete to track your progress through the workbook.