Why Resilience Matters
Why Circuit Breaker Matters
The Problem: In a monolith, a slow dependency hurts one process. In microservices, a slow dependency can cascade — Service A’s threads queue waiting on Service B, then Service C’s threads queue waiting on A, until the entire mesh is wedged.
The Solution: A circuit breaker watches calls to a dependency. After enough failures it fails fast instead of waiting, so callers free their threads, recover, and (importantly) your monitoring still works.
Real Impact: Netflix runs thousands of services. The circuit breaker is the reason a bad recommendation service doesn’t take down playback.
Real-World Analogy
Think of an electrical panel in your house:
- Closed breaker = electricity flows normally to the outlet
- Short circuit = the breaker trips OPEN to stop current and prevent fire
- You reset it after fixing the problem — it goes back to closed
- Without a breaker, a short anywhere in the house can burn the whole place down
A software circuit breaker does the same thing for service calls — it stops the flow when the downstream is on fire so your service doesn’t catch fire too.
Failures in distributed systems are not exceptional events. They are the steady state. Network partitions, GC pauses, deploy rollouts, full disks, exhausted thread pools — every day, somewhere in your stack, something is degraded. The job of resilience patterns is to make sure those failures stay localized instead of becoming systemic.
The Cascading Failure Problem
Cascades follow a predictable shape:
- A downstream dependency slows from 50 ms to 5 s — for any reason: GC, load spike, sick host, dependent-of-dependent failure.
- Callers don’t notice. They keep sending requests, each one now occupying a thread for 5 s instead of 50 ms.
- The caller’s thread pool fills up. New requests pile up in the queue.
- The caller starts rejecting requests, returning 503 to its callers.
- Those callers retry — adding more load to a system that’s already underwater.
- The blast radius doubles every hop. Within minutes, the whole mesh is down.
The dangerous step is #2: callers don’t notice. By the time a human sees the alert, the damage is already pinned across the system. A circuit breaker breaks step #2.
Failure Modes to Defend Against
| Failure Mode | What It Looks Like | Pattern That Helps |
|---|---|---|
| Slow dependency | P99 latency climbs from 50 ms to 5 s | Timeout + circuit breaker |
| Failed dependency | Errors above some threshold | Circuit breaker + fallback |
| Transient flake | 1 in 100 calls fails for no reason | Retry with backoff + jitter |
| Resource exhaustion | One bad caller eats all DB connections | Bulkhead |
| Retry storm | Everyone retries at once when service recovers | Jittered backoff + rate limiting |
| Thundering herd | Cache expires, traffic floods origin | Request coalescing + jitter |
The Circuit Breaker Pattern
Why a State Machine
The Problem: “Just stop calling the broken service” sounds simple, but you also need to know when it’s healthy again without DDoSing it during recovery.
The Solution: A three-state machine — Closed, Open, Half-Open — that lets traffic through normally, blocks it during failure, and carefully probes during recovery.
Every circuit breaker library — Resilience4j, Hystrix, Polly, gobreaker — implements the same three states:
State definitions
- Closed: The default. Calls pass through and the breaker counts successes and failures over a sliding window. If the failure rate exceeds the threshold, it trips OPEN.
- Open: Calls do not reach the dependency. They fail immediately (often with a fallback). The breaker stays open for a cooldown period — long enough for the dependency to recover, short enough that you’re not blind to recovery.
- Half-Open: After cooldown, a small number of probe calls are allowed through. If they succeed, the breaker returns to Closed. If they fail, it goes straight back to Open and resets the cooldown.
What to count, and over what window
Naive implementations count consecutive failures. Production implementations count failure rate over a sliding window. The difference matters: 5 failures in a row out of 5 calls is unambiguous; 5 failures out of 10,000 calls is noise.
- Sliding window size: “the last N calls” or “the last N seconds”. 10–100 calls is typical.
- Minimum number of calls: Don’t trip on a sample of 2. Wait until you have at least 5–10 data points.
- Failure rate threshold: 50% is a common default. Lower for critical paths, higher for best-effort calls.
- Slow-call detection: A call that times out is a failure. A call that takes 5x longer than P99 is also a failure even if it eventually succeeds.
Building a Circuit Breaker
Why Implement, Not Just Use a Library
The Problem: Libraries give you the right defaults, but you’ll only trust the breaker in production if you understand the state machine well enough to debug it at 2 AM.
The Solution: Build a from-scratch breaker once to internalize the model. Then use a library — Resilience4j, gobreaker, Polly — for everything real.
From scratch in Python
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar
from collections import deque
T = TypeVar("T")
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreakerOpen(Exception): pass
@dataclass
class CircuitBreaker:
failure_rate_threshold: float = 0.5 # 50% failure rate
minimum_calls: int = 10 # before evaluating
window_size: int = 20 # sliding window
cooldown_seconds: float = 30 # OPEN -> HALF_OPEN delay
half_open_max_calls: int = 3 # probes before deciding
state: State = State.CLOSED
opened_at: float = 0.0
half_open_calls: int = 0
history: deque = field(default_factory=lambda: deque(maxlen=20))
def call(self, fn: Callable[[], T]) -> T:
self._maybe_transition_to_half_open()
if self.state == State.OPEN:
raise CircuitBreakerOpen("breaker is open")
if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max_calls:
raise CircuitBreakerOpen("half-open quota exhausted")
try:
result = fn()
self._record(True)
return result
except Exception:
self._record(False)
raise
def _record(self, success: bool) -> None:
self.history.append(success)
if self.state == State.HALF_OPEN:
self.half_open_calls += 1
if not success:
self._open()
elif self.half_open_calls >= self.half_open_max_calls:
self._close()
elif self.state == State.CLOSED and self._should_open():
self._open()
def _should_open(self) -> bool:
if len(self.history) < self.minimum_calls:
return False
failures = sum(1 for ok in self.history if not ok)
return failures / len(self.history) >= self.failure_rate_threshold
def _open(self) -> None:
self.state = State.OPEN
self.opened_at = time.monotonic()
self.half_open_calls = 0
def _close(self) -> None:
self.state = State.CLOSED
self.history.clear()
self.half_open_calls = 0
def _maybe_transition_to_half_open(self) -> None:
if self.state == State.OPEN and time.monotonic() - self.opened_at >= self.cooldown_seconds:
self.state = State.HALF_OPEN
self.half_open_calls = 0
About 60 lines of straightforward Python and you have most of what a real breaker does. Notice what we deliberately did not do: this implementation is single-threaded. In production, every state transition needs a lock or atomic operations, because callers will share the same breaker across threads.
Production setup with Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import java.time.Duration;
import java.util.function.Supplier;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // trip at 50% failure rate
.slowCallRateThreshold(50) // also trip on too many slow calls
.slowCallDurationThreshold(Duration.ofSeconds(2)) // what counts as “slow”
.waitDurationInOpenState(Duration.ofSeconds(30)) // cooldown before half-open
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(20) // last 20 calls
.minimumNumberOfCalls(10) // don’t evaluate until 10 samples
.permittedNumberOfCallsInHalfOpenState(3) // probe with 3 calls
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class) // 4xx-style errors aren’t infra failures
.build();
CircuitBreaker breaker = CircuitBreaker.of("payment-service", config);
Supplier<PaymentResult> protected = CircuitBreaker
.decorateSupplier(breaker, () -> paymentClient.charge(req));
try {
PaymentResult r = protected.get();
} catch (CallNotPermittedException open) {
// breaker is OPEN — return cached price, queue, or 503 fast
return fallback.handle(req);
}
Don’t catch the wrong things
recordExceptions and ignoreExceptions are critical and easy to get wrong. A 401 from a downstream is not an infrastructure failure — it’s a business outcome. If you record it, the breaker will trip on a wave of unauthorized requests and you’ll cut off paying customers. Only record exceptions that mean “the dependency is sick.”
Retry with Exponential Backoff
Why Retries Need Math
The Problem: A naive retry: 3 means that when 1,000 callers hit a flaky service, you immediately turn 1,000 requests into 3,000.
The Solution: Exponential backoff with jitter spreads retries out in time so they don’t synchronize into a thundering herd.
Three things make a retry policy safe:
- Exponential backoff — wait 1s, then 2s, then 4s, then 8s. Gives the dependency time to breathe.
- Jitter — randomize each delay so 1,000 callers don’t all retry at exactly the same instant.
- A budget — cap retries (3–5 attempts), cap total wait time, and only retry idempotent operations.
import random, time
from typing import Callable, TypeVar
T = TypeVar("T")
def retry_with_backoff(
fn: Callable[[], T],
max_attempts: int = 5,
base_delay: float = 1.0, # seconds
max_delay: float = 30.0,
multiplier: float = 2.0,
retryable: tuple = (TimeoutError, ConnectionError),
) -> T:
last_exc = None
for attempt in range(max_attempts):
try:
return fn()
except retryable as e:
last_exc = e
if attempt == max_attempts - 1:
break
# Full jitter: pick a random delay in [0, exp_backoff)
backoff = min(max_delay, base_delay * (multiplier ** attempt))
sleep = random.uniform(0, backoff)
time.sleep(sleep)
raise last_exc
Three flavors of jitter
- Full jitter —
sleep = random(0, backoff). Maximally spreads retries; AWS recommends this. - Equal jitter —
sleep = backoff/2 + random(0, backoff/2). Compromise between predictable and spread. - Decorrelated jitter — uses the previous sleep to compute the next; smoother under sustained load.
The wrong answer is no jitter.
Only retry what is safe to retry
A retried POST /charge can charge a customer twice. Either: (a) only retry idempotent verbs (GET, PUT, DELETE), (b) require an idempotency key on every write so the server can deduplicate, or (c) accept eventual consistency and reconcile later. Retries without idempotency are how outages turn into incidents.
The Bulkhead Pattern
Why Isolate Resources
The Problem: If every dependency shares the same thread pool, the slowest one starves all the others. One sick service drags everyone down.
The Solution: A bulkhead — named after a ship’s watertight compartments — gives each dependency its own pool. When one floods, the others stay dry.
Real-World Analogy
A ship doesn’t have one giant hull. It has watertight compartments. If the hull is breached in section 3, water fills section 3 and stops. The ship limps along; it doesn’t sink. Microservice bulkheads do the same with thread pools, connection pools, and queues.
| Bulkhead Type | What It Limits | When to Use |
|---|---|---|
| Thread pool | Concurrent threads per dependency | Blocking I/O, library calls you can’t make async |
| Semaphore | Concurrent in-flight calls | Async / non-blocking I/O |
| Connection pool | Concurrent DB or HTTP connections | Anywhere connection limits matter |
| Queue | Work waiting to run, with shed-load | Async pipelines, message handlers |
Semaphore bulkhead in Node.js
class Bulkhead {
constructor(maxConcurrent = 10, maxQueue = 50) {
this.maxConcurrent = maxConcurrent;
this.maxQueue = maxQueue;
this.inFlight = 0;
this.queue = [];
}
async execute(task) {
if (this.inFlight >= this.maxConcurrent) {
if (this.queue.length >= this.maxQueue) {
throw new Error('bulkhead full: shedding load');
}
await new Promise(resolve => this.queue.push(resolve));
}
this.inFlight++;
try {
return await task();
} finally {
this.inFlight--;
const next = this.queue.shift();
if (next) next();
}
}
}
// Each dependency gets its own bulkhead
const payments = new Bulkhead(5, 10); // max 5 in-flight, 10 queued
const shipping = new Bulkhead(20, 50); // shipping gets more headroom
await payments.execute(() => chargeCard(order));
await shipping.execute(() => createLabel(order));
Notice the load-shedding behavior: when both maxConcurrent and maxQueue are full, the bulkhead throws immediately. That is correct. A bulkhead that quietly accepts unlimited work is not a bulkhead; it’s a memory leak with extra steps.
Combining Patterns: Defense in Depth
None of these patterns is sufficient on its own. In production you compose them. The order matters — what wraps what determines what fails first.
The decoration order I recommend
- Bulkhead (outermost) — first line of defense; reject if no capacity.
- Circuit breaker — if the dependency is sick, fail fast.
- Timeout — cap how long a single attempt can run.
- Retry — only retry transient failures.
- Fallback (innermost) — last-ditch alternative response.
A retry that is inside a circuit breaker is good: failed retries count toward tripping the breaker. A retry outside the breaker bypasses it — almost always wrong.
// Resilience4j composition (Java)
Supplier<Result> protected = Decorators.ofSupplier(() -> service.call(req))
.withTimeLimiter(timeLimiter) // 4. one-attempt timeout
.withRetry(retry) // 5. retry transients
.withCircuitBreaker(breaker) // 2. fail fast if sick
.withBulkhead(bulkhead) // 1. shed load
.withFallback(List.of(CallNotPermittedException.class),
ex -> cache.getLastKnownGood(req)) // 6. fallback
.decorate();
Result r = protected.get();
The fallback is part of your contract
A fallback that returns stale cache, an empty list, or a default value is visible to callers. If your “graceful degradation” is showing the user yesterday’s prices, customers need to know that’s a possibility (or you need to cap how stale you’ll go). A silent fallback is a footgun. Log every fallback fire and alert when the rate spikes.
Chaos Engineering
Why Break Things on Purpose
The Problem: Every resilience pattern in this tutorial is untested until it has actually fired in production. The day you find out your circuit breaker is misconfigured should not be the day a real outage happens.
The Solution: Inject controlled failures — latency, errors, instance kills — in test and (eventually) production, with monitoring, in business hours, with a rollback plan.
Netflix invented this discipline with Chaos Monkey (kills random instances) and the Simian Army (latency, region outages, etc.). Today the standard tools are LitmusChaos and Chaos Mesh on Kubernetes, Gremlin as a SaaS, and AWS Fault Injection Simulator for AWS workloads.
| Experiment | What It Tests | Typical Setup |
|---|---|---|
| Latency injection | Slow dependency handling | Add 2–5 s delay to a % of calls |
| Error injection | Circuit breaker tuning | Return 5xx for 5–20% of calls |
| Instance termination | Failover, rebalancing | SIGKILL random pods |
| Network partition | Split-brain, quorum | iptables-drop traffic between AZs |
| CPU/memory pressure | Autoscaling, eviction | Run a stressor next to the workload |
| DNS failure | Caching, fallback hosts | Block resolver for 30 s |
The hypothesis-driven shape of an experiment
# A real chaos experiment is a written hypothesis, not a button you press.
name: "Payment service: 2s latency on 50% of calls"
hypothesis: |
When the payment service experiences 2 seconds of added latency on
half of its requests, the order service’s circuit breaker trips
within 60 seconds, fallbacks engage, and overall checkout success
rate stays above 95%.
steady_state:
- metric: checkout_success_rate
min: 95
- metric: order_p99_latency_ms
max: 3000
attack:
target: payment-service
type: latency
magnitude: 2000ms
affected_percentage: 50
duration: 10m
rollback:
on_steady_state_violation: true
notify: ["#oncall-payments", "pagerduty:payments"]
Run experiments in this order
- Local dev with mocks — cheapest. Validates the code paths exist.
- Staging with synthetic traffic — validates configuration and dashboards.
- Production, business hours, small blast radius — validates the real system.
- Production, broader scope, with explicit hypothesis and rollback — only after the previous three pass cleanly.
Observability
Why Metrics Matter Here
The Problem: A circuit breaker that’s working correctly is invisible — calls just succeed. The day it stops working you find out from your customers.
The Solution: Treat every breaker, retry budget, and bulkhead as a metric source. Alert on state, not just on outcomes.
Every resilience library exposes the same shape of metrics. The non-negotiable ones:
circuit_breaker_state— gauge per service (0=closed, 1=open, 2=half-open). Alert on any breaker open longer than N minutes.circuit_breaker_calls_total— counter labeled by outcome (success / failed / not_permitted / slow).retry_calls_total— counter labeled by attempt number. Spike in attempt=4 means the dependency is shaky.bulkhead_available_concurrent_calls— gauge. If it sits at 0, you’re shedding load.fallback_invocations_total— counter. The hidden cost of resilience — if this number is non-zero, customers are seeing degraded responses.
# Prometheus exposition for a Python circuit breaker
from prometheus_client import Counter, Gauge, Histogram
cb_state = Gauge(
"circuit_breaker_state",
"0=closed, 1=open, 2=half_open",
["service"],
)
cb_calls = Counter(
"circuit_breaker_calls_total",
"calls observed by the breaker",
["service", "outcome"], # success | failed | not_permitted | slow
)
cb_open_seconds = Histogram(
"circuit_breaker_open_duration_seconds",
"how long each open episode lasts",
["service"],
buckets=[1, 5, 15, 30, 60, 300, 900],
)
def on_state_change(service: str, new_state: State, episode_seconds: float | None):
cb_state.labels(service=service).set({State.CLOSED: 0, State.OPEN: 1, State.HALF_OPEN: 2}[new_state])
if new_state == State.CLOSED and episode_seconds is not None:
cb_open_seconds.labels(service=service).observe(episode_seconds)
Three alerts every breaker should have
- Breaker open > 5 min: page someone — the dependency hasn’t recovered.
- Fallback rate > 1%: page someone — customers are seeing degraded responses.
- Breaker flapping (state changes > N/min): ticket someone — thresholds are misconfigured.
Best Practices
The short list
- One breaker per dependency. Not per endpoint, not per pod — per logical downstream service.
- Tune from production data. Default thresholds will be wrong. Read the dashboard for two weeks before tightening.
- Failures must be classified. Network errors and 5xx count; 4xx and business exceptions don’t.
- Always pair with timeouts. A breaker with no timeout still hangs threads in the slow case.
- Always pair with bulkheads. A breaker stops new traffic; the bulkhead protects you while it trips.
- Test fallbacks like real code. They run during outages — which is the worst time to discover a NullPointerException in your “safe” path.
- Run game days. Quarterly is plenty. The team that has already practiced the failure recovers faster than the team that hasn’t.
How the giants do it
Netflix built Hystrix (now in maintenance; the patterns live on in Resilience4j and Polly) and pioneered chaos engineering with the Simian Army. Every Netflix service-to-service call goes through a circuit breaker by default. The blast radius of a single bad service is bounded by design.
Amazon uses cell-based architecture: each service is split into independent cells, customers are assigned to cells, and cells fail independently. A bug in one cell affects ~1/N of users instead of 100%. Combined with shuffle sharding (each customer’s cell membership is randomized) the probability of any two customers sharing all the same failure dependencies approaches zero.
Google’s SRE practice codifies the philosophy: everything fails, all the time. Build for it. The error budget you set at the SLO level is the budget that resilience patterns are built to spend.
The single most useful sentence about resilience
Whenever you find yourself writing “but the downstream shouldn’t fail like that” — stop. The downstream will fail like that. The point of these patterns is that you don’t need to predict how it will fail to handle it gracefully when it does.