Circuit Breaker
Fail fast when a dependency is sick, so one slow service doesn't drag down the whole system.
A circuit breaker protects your service from repeatedly calling a dependency that is already failing or dangerously slow. Instead of letting every request wait on the same sick payment gateway, search cluster, or shipping API, the breaker trips and fails fast. That sounds harsh, but fast failure is often what keeps the rest of the system alive.
The problem: cascading failure
Distributed systems fail by waiting. A dependency does not need to be fully down to hurt you; it can simply become slow. If your checkout API has 200 request threads and each payment call normally takes 100 ms, the system is healthy. If the payment provider starts taking 30 seconds, those 200 threads fill with waiting calls. Soon even requests that do not need payments cannot get a thread, and one slow dependency becomes a full checkout outage.
healthy:
checkout thread -> payment API returns in 100 ms -> thread is free
payment API degraded:
checkout thread -> waits 30 seconds
more requests arrive -> more threads wait
thread pool fills -> queue grows
callers retry -> even more traffic
checkout appears down even though its own code is healthy- Resource exhaustion: threads, sockets, connection pools, and memory are held by doomed calls.
- Retry storms: clients retry slow calls, multiplying load on both your service and the failing dependency.
- Backpressure arrives too late: by the time queues are full, latency has already spread to unrelated features.
The state machine: closed, open, half-open
A breaker is a small state machine wrapped around a remote call. It is normally closed, meaning traffic flows through. After too many failures, it becomes open, meaning calls fail immediately. After a reset timeout, it becomes half-open, allowing a limited number of probe calls to test recovery.
state = "closed"
if state == "open" and now < opened_at + reset_timeout:
return fallback_or_fast_error()
if state == "open" and now >= opened_at + reset_timeout:
state = "half_open"
try:
response = call_dependency_with_timeout()
record_success()
if state == "half_open":
state = "closed"
return response
except TimeoutOrError:
record_failure()
if failure_rate_over_window() > threshold:
state = "open"
opened_at = now
return fallback_or_fast_error()Thresholds, windows, and reset timeouts
A breaker needs a memory of recent calls. Most production libraries use a sliding window: the last N calls or the last T seconds. The breaker opens when enough calls have occurred and the failure percentage crosses a configured threshold. This avoids tripping on one random packet loss.
| Setting | Purpose | Typical question |
|---|---|---|
| Minimum calls | Avoids decisions on tiny samples | Have we seen at least 20 calls? |
| Failure threshold | Defines when the dependency is unhealthy | Are more than 50% failing? |
| Slow-call threshold | Treats dangerous latency as failure | Are calls taking over 2 seconds? |
| Reset timeout | Waits before probing recovery | Should we try again after 30 seconds? |
| Half-open permits | Limits recovery traffic | Allow 1 to 10 probes, not full traffic |
Failures include latency
Count timeouts and slow calls, not only HTTP 500 responses. A dependency that returns after 60 seconds is effectively failing if your user-facing timeout is 2 seconds. Breakers work best when every dependency call has a strict timeout.
Fast-fail and fallback behavior
When a breaker is open, you have two choices: return a fast error or return a degraded response. The important part is that you do not keep making the failing call. A fallback should be honest, cheap, and safe.
| Dependency | Possible fallback | Risk |
|---|---|---|
| Recommendation API | Show popular items from cache | Less personalized |
| Inventory estimate | Show unavailable or ask user to retry | May reduce conversions |
| Payment gateway | Fail checkout with a clear retry message | User cannot complete purchase |
| Fraud scoring | Route to manual review | More operational work |
- Fail closed: block the action when safety matters, such as payment, authorization, or fraud checks.
- Fail open: degrade gracefully when the feature is optional, such as recommendations, avatars, or analytics.
- Cached fallback: serve a stale but acceptable value if users benefit more from partial data than from an error.
Pairing breakers with timeouts, retries, and bulkheads
Circuit breakers are one piece of a resilience toolkit. They should almost never be the only protection around a remote call. Use timeouts to bound waiting, retries to handle small transient failures, backoff and jitter to avoid synchronized retry storms, and bulkheads to isolate resources per dependency.
user request
→ bulkhead: acquire a small dependency-specific slot
→ circuit breaker: is this dependency currently allowed?
→ timeout: cap each attempt
→ retry: only safe errors, exponential backoff + jitter
→ dependency call
→ fallback or response| Pattern | Protects against | How it pairs with breaker |
|---|---|---|
| Timeout | Unbounded waiting | Turns slow calls into counted failures |
| Retry with backoff+jitter | Brief network blips | Must stop retrying when breaker opens |
| Bulkhead | One dependency consuming all resources | Limits damage before breaker trips |
| Rate limit | Too much incoming or outgoing traffic | Reduces pressure on a recovering service |
In complex workflows, a breaker may cause one saga step to fail and trigger compensation. That is better than letting the workflow hang forever. See the Saga Pattern lesson for how long-running business flows recover from partial failure.
Libraries, examples, and gotchas
Most teams should use a battle-tested library instead of writing a breaker from scratch. Java services commonly use resilience4j. Older Netflix systems used Hystrix, which popularized circuit breakers and bulkheads but is now in maintenance mode. .NET teams often use Polly, and many service meshes provide outlier detection at the proxy layer.
- Per-dependency breakers: do not share one breaker across unrelated APIs. The shipping provider should not trip the payment provider.
- Per-operation breakers: a cheap read endpoint and an expensive write endpoint may need different thresholds.
- Observability: emit breaker state, open duration, failure rate, slow-call rate, fallback count, and half-open probe outcomes.
- Cold starts: avoid opening on the first failure after a quiet period; require a minimum number of calls in the window.
- Circuit breakers stop cascading failure by failing fast when a dependency is repeatedly failing or too slow.
- The state machine is closed (normal calls), open (fast fail), and half-open (limited recovery probes).
- Configure failure thresholds, minimum call windows, slow-call detection, reset timeout, and half-open probe limits.
- Breakers work best with timeouts, retries using backoff and jitter, bulkheads, rate limits, and honest fallbacks.
- Use mature libraries such as resilience4j, Polly, or legacy Hystrix-style designs, and monitor breaker state as a first-class signal.
Mark it complete to track your progress through the workbook.