DrawLintDrawLint.ai
🗺️Design Patterns·6 min read

Circuit Breaker

Fail fast when a dependency is sick, so one slow service doesn't drag down the whole system.

A circuit breaker protects your service from repeatedly calling a dependency that is already failing or dangerously slow. Instead of letting every request wait on the same sick payment gateway, search cluster, or shipping API, the breaker trips and fails fast. That sounds harsh, but fast failure is often what keeps the rest of the system alive.

🔭Think of it like…
The electrical breaker in a house does not fix a short circuit. It cuts power so the fault does not overheat wires and burn the house down. A software circuit breaker does the same for remote calls: it stops feeding traffic into a failure until there is evidence the dependency can handle traffic again.

The problem: cascading failure

Distributed systems fail by waiting. A dependency does not need to be fully down to hurt you; it can simply become slow. If your checkout API has 200 request threads and each payment call normally takes 100 ms, the system is healthy. If the payment provider starts taking 30 seconds, those 200 threads fill with waiting calls. Soon even requests that do not need payments cannot get a thread, and one slow dependency becomes a full checkout outage.

how slowness cascades
healthy:
  checkout thread -> payment API returns in 100 ms -> thread is free

payment API degraded:
  checkout thread -> waits 30 seconds
  more requests arrive -> more threads wait
  thread pool fills -> queue grows
  callers retry -> even more traffic
  checkout appears down even though its own code is healthy
  • Resource exhaustion: threads, sockets, connection pools, and memory are held by doomed calls.
  • Retry storms: clients retry slow calls, multiplying load on both your service and the failing dependency.
  • Backpressure arrives too late: by the time queues are full, latency has already spread to unrelated features.
The core idea
Track recent failures. When the dependency crosses a threshold, stop calling it for a reset window. Return a fallback or a fast error until a small probe succeeds.

The state machine: closed, open, half-open

A breaker is a small state machine wrapped around a remote call. It is normally closed, meaning traffic flows through. After too many failures, it becomes open, meaning calls fail immediately. After a reset timeout, it becomes half-open, allowing a limited number of probe calls to test recovery.

CLOSEDcalls pass throughOPENfail fastHALF-OPENprobe recoveryfailure thresholdreset timeoutprobe succeeds → closeprobe fails → re-open
Circuit breaker state machine
minimal breaker logic
state = "closed"

if state == "open" and now < opened_at + reset_timeout:
    return fallback_or_fast_error()

if state == "open" and now >= opened_at + reset_timeout:
    state = "half_open"

try:
    response = call_dependency_with_timeout()
    record_success()
    if state == "half_open":
        state = "closed"
    return response
except TimeoutOrError:
    record_failure()
    if failure_rate_over_window() > threshold:
        state = "open"
        opened_at = now
    return fallback_or_fast_error()

Thresholds, windows, and reset timeouts

A breaker needs a memory of recent calls. Most production libraries use a sliding window: the last N calls or the last T seconds. The breaker opens when enough calls have occurred and the failure percentage crosses a configured threshold. This avoids tripping on one random packet loss.

SettingPurposeTypical question
Minimum callsAvoids decisions on tiny samplesHave we seen at least 20 calls?
Failure thresholdDefines when the dependency is unhealthyAre more than 50% failing?
Slow-call thresholdTreats dangerous latency as failureAre calls taking over 2 seconds?
Reset timeoutWaits before probing recoveryShould we try again after 30 seconds?
Half-open permitsLimits recovery trafficAllow 1 to 10 probes, not full traffic

Failures include latency

Count timeouts and slow calls, not only HTTP 500 responses. A dependency that returns after 60 seconds is effectively failing if your user-facing timeout is 2 seconds. Breakers work best when every dependency call has a strict timeout.

Tune for the user journey
A search suggestion service can trip aggressively and show fewer suggestions. A payment authorization service may need a more cautious threshold and a clear error path. The breaker policy belongs to the business impact of the call, not to a generic global default.

Fast-fail and fallback behavior

When a breaker is open, you have two choices: return a fast error or return a degraded response. The important part is that you do not keep making the failing call. A fallback should be honest, cheap, and safe.

DependencyPossible fallbackRisk
Recommendation APIShow popular items from cacheLess personalized
Inventory estimateShow unavailable or ask user to retryMay reduce conversions
Payment gatewayFail checkout with a clear retry messageUser cannot complete purchase
Fraud scoringRoute to manual reviewMore operational work
  • Fail closed: block the action when safety matters, such as payment, authorization, or fraud checks.
  • Fail open: degrade gracefully when the feature is optional, such as recommendations, avatars, or analytics.
  • Cached fallback: serve a stale but acceptable value if users benefit more from partial data than from an error.
Fallbacks can become hidden dependencies
A fallback that calls another overloaded service can create a second cascade. Keep fallbacks local when possible: cached data, static defaults, queued work, or a clear fast error.

Pairing breakers with timeouts, retries, and bulkheads

Circuit breakers are one piece of a resilience toolkit. They should almost never be the only protection around a remote call. Use timeouts to bound waiting, retries to handle small transient failures, backoff and jitter to avoid synchronized retry storms, and bulkheads to isolate resources per dependency.

resilience wrapper order
user request
  → bulkhead: acquire a small dependency-specific slot
  → circuit breaker: is this dependency currently allowed?
  → timeout: cap each attempt
  → retry: only safe errors, exponential backoff + jitter
  → dependency call
  → fallback or response
PatternProtects againstHow it pairs with breaker
TimeoutUnbounded waitingTurns slow calls into counted failures
Retry with backoff+jitterBrief network blipsMust stop retrying when breaker opens
BulkheadOne dependency consuming all resourcesLimits damage before breaker trips
Rate limitToo much incoming or outgoing trafficReduces pressure on a recovering service

In complex workflows, a breaker may cause one saga step to fail and trigger compensation. That is better than letting the workflow hang forever. See the Saga Pattern lesson for how long-running business flows recover from partial failure.

Libraries, examples, and gotchas

Most teams should use a battle-tested library instead of writing a breaker from scratch. Java services commonly use resilience4j. Older Netflix systems used Hystrix, which popularized circuit breakers and bulkheads but is now in maintenance mode. .NET teams often use Polly, and many service meshes provide outlier detection at the proxy layer.

  • Per-dependency breakers: do not share one breaker across unrelated APIs. The shipping provider should not trip the payment provider.
  • Per-operation breakers: a cheap read endpoint and an expensive write endpoint may need different thresholds.
  • Observability: emit breaker state, open duration, failure rate, slow-call rate, fallback count, and half-open probe outcomes.
  • Cold starts: avoid opening on the first failure after a quiet period; require a minimum number of calls in the window.
Real-world mental model
A breaker is not an error handler. It is an adaptive traffic controller for one dependency. Its job is to preserve capacity for work that can still succeed while the failing dependency recovers.
Key takeaways
  • Circuit breakers stop cascading failure by failing fast when a dependency is repeatedly failing or too slow.
  • The state machine is closed (normal calls), open (fast fail), and half-open (limited recovery probes).
  • Configure failure thresholds, minimum call windows, slow-call detection, reset timeout, and half-open probe limits.
  • Breakers work best with timeouts, retries using backoff and jitter, bulkheads, rate limits, and honest fallbacks.
  • Use mature libraries such as resilience4j, Polly, or legacy Hystrix-style designs, and monitor breaker state as a first-class signal.
Slow calls hold threads, sockets, and connection-pool entries while they wait. As traffic continues, those resources fill up, queues grow, and unrelated requests cannot be served. A breaker turns repeated slowness into fast failure so the service keeps capacity for other work.
After the reset timeout, the breaker allows a small number of probe calls through. If probes succeed, the breaker closes and normal traffic resumes. If a probe fails or is too slow, the breaker re-opens and waits for another reset window.
Backoff spaces retries out, and jitter randomizes them so thousands of clients do not retry at the same instant. Without those controls, retries can amplify load on the failing dependency and make the breaker open more often.
Finished this lesson?

Mark it complete to track your progress through the workbook.