Circuit Breaker & Resilience

When dependencies fail, the question is not if your service should respond — it’s how fast it should fail. Circuit breakers, retries, and bulkheads are how production microservices stay alive.

Medium25 min read

Why Resilience Matters

Why Circuit Breaker Matters

The Problem: In a monolith, a slow dependency hurts one process. In microservices, a slow dependency can cascade — Service A’s threads queue waiting on Service B, then Service C’s threads queue waiting on A, until the entire mesh is wedged.

The Solution: A circuit breaker watches calls to a dependency. After enough failures it fails fast instead of waiting, so callers free their threads, recover, and (importantly) your monitoring still works.

Real Impact: Netflix runs thousands of services. The circuit breaker is the reason a bad recommendation service doesn’t take down playback.

Real-World Analogy

Think of an electrical panel in your house:

  • Closed breaker = electricity flows normally to the outlet
  • Short circuit = the breaker trips OPEN to stop current and prevent fire
  • You reset it after fixing the problem — it goes back to closed
  • Without a breaker, a short anywhere in the house can burn the whole place down

A software circuit breaker does the same thing for service calls — it stops the flow when the downstream is on fire so your service doesn’t catch fire too.

Failures in distributed systems are not exceptional events. They are the steady state. Network partitions, GC pauses, deploy rollouts, full disks, exhausted thread pools — every day, somewhere in your stack, something is degraded. The job of resilience patterns is to make sure those failures stay localized instead of becoming systemic.

The Cascading Failure Problem

Cascades follow a predictable shape:

  1. A downstream dependency slows from 50 ms to 5 s — for any reason: GC, load spike, sick host, dependent-of-dependent failure.
  2. Callers don’t notice. They keep sending requests, each one now occupying a thread for 5 s instead of 50 ms.
  3. The caller’s thread pool fills up. New requests pile up in the queue.
  4. The caller starts rejecting requests, returning 503 to its callers.
  5. Those callers retry — adding more load to a system that’s already underwater.
  6. The blast radius doubles every hop. Within minutes, the whole mesh is down.

The dangerous step is #2: callers don’t notice. By the time a human sees the alert, the damage is already pinned across the system. A circuit breaker breaks step #2.

Failure Modes to Defend Against

Failure ModeWhat It Looks LikePattern That Helps
Slow dependencyP99 latency climbs from 50 ms to 5 sTimeout + circuit breaker
Failed dependencyErrors above some thresholdCircuit breaker + fallback
Transient flake1 in 100 calls fails for no reasonRetry with backoff + jitter
Resource exhaustionOne bad caller eats all DB connectionsBulkhead
Retry stormEveryone retries at once when service recoversJittered backoff + rate limiting
Thundering herdCache expires, traffic floods originRequest coalescing + jitter

The Circuit Breaker Pattern

Why a State Machine

The Problem: “Just stop calling the broken service” sounds simple, but you also need to know when it’s healthy again without DDoSing it during recovery.

The Solution: A three-state machine — Closed, Open, Half-Open — that lets traffic through normally, blocks it during failure, and carefully probes during recovery.

Every circuit breaker library — Resilience4j, Hystrix, Polly, gobreaker — implements the same three states:

Circuit Breaker State Machine CLOSED traffic flows OPEN fail fast HALF-OPEN probe failures > threshold cooldown elapsed probes succeed probe fails

State definitions

  • Closed: The default. Calls pass through and the breaker counts successes and failures over a sliding window. If the failure rate exceeds the threshold, it trips OPEN.
  • Open: Calls do not reach the dependency. They fail immediately (often with a fallback). The breaker stays open for a cooldown period — long enough for the dependency to recover, short enough that you’re not blind to recovery.
  • Half-Open: After cooldown, a small number of probe calls are allowed through. If they succeed, the breaker returns to Closed. If they fail, it goes straight back to Open and resets the cooldown.

What to count, and over what window

Naive implementations count consecutive failures. Production implementations count failure rate over a sliding window. The difference matters: 5 failures in a row out of 5 calls is unambiguous; 5 failures out of 10,000 calls is noise.

Building a Circuit Breaker

Why Implement, Not Just Use a Library

The Problem: Libraries give you the right defaults, but you’ll only trust the breaker in production if you understand the state machine well enough to debug it at 2 AM.

The Solution: Build a from-scratch breaker once to internalize the model. Then use a library — Resilience4j, gobreaker, Polly — for everything real.

From scratch in Python

import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar
from collections import deque

T = TypeVar("T")

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreakerOpen(Exception): pass

@dataclass
class CircuitBreaker:
    failure_rate_threshold: float = 0.5     # 50% failure rate
    minimum_calls: int = 10                 # before evaluating
    window_size: int = 20                   # sliding window
    cooldown_seconds: float = 30            # OPEN -> HALF_OPEN delay
    half_open_max_calls: int = 3            # probes before deciding

    state: State = State.CLOSED
    opened_at: float = 0.0
    half_open_calls: int = 0
    history: deque = field(default_factory=lambda: deque(maxlen=20))

    def call(self, fn: Callable[[], T]) -> T:
        self._maybe_transition_to_half_open()

        if self.state == State.OPEN:
            raise CircuitBreakerOpen("breaker is open")

        if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max_calls:
            raise CircuitBreakerOpen("half-open quota exhausted")

        try:
            result = fn()
            self._record(True)
            return result
        except Exception:
            self._record(False)
            raise

    def _record(self, success: bool) -> None:
        self.history.append(success)
        if self.state == State.HALF_OPEN:
            self.half_open_calls += 1
            if not success:
                self._open()
            elif self.half_open_calls >= self.half_open_max_calls:
                self._close()
        elif self.state == State.CLOSED and self._should_open():
            self._open()

    def _should_open(self) -> bool:
        if len(self.history) < self.minimum_calls:
            return False
        failures = sum(1 for ok in self.history if not ok)
        return failures / len(self.history) >= self.failure_rate_threshold

    def _open(self) -> None:
        self.state = State.OPEN
        self.opened_at = time.monotonic()
        self.half_open_calls = 0

    def _close(self) -> None:
        self.state = State.CLOSED
        self.history.clear()
        self.half_open_calls = 0

    def _maybe_transition_to_half_open(self) -> None:
        if self.state == State.OPEN and time.monotonic() - self.opened_at >= self.cooldown_seconds:
            self.state = State.HALF_OPEN
            self.half_open_calls = 0

About 60 lines of straightforward Python and you have most of what a real breaker does. Notice what we deliberately did not do: this implementation is single-threaded. In production, every state transition needs a lock or atomic operations, because callers will share the same breaker across threads.

Production setup with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import java.time.Duration;
import java.util.function.Supplier;

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                          // trip at 50% failure rate
    .slowCallRateThreshold(50)                         // also trip on too many slow calls
    .slowCallDurationThreshold(Duration.ofSeconds(2))   // what counts as “slow”
    .waitDurationInOpenState(Duration.ofSeconds(30))    // cooldown before half-open
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)                            // last 20 calls
    .minimumNumberOfCalls(10)                         // don’t evaluate until 10 samples
    .permittedNumberOfCallsInHalfOpenState(3)         // probe with 3 calls
    .recordExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class)        // 4xx-style errors aren’t infra failures
    .build();

CircuitBreaker breaker = CircuitBreaker.of("payment-service", config);

Supplier<PaymentResult> protected = CircuitBreaker
    .decorateSupplier(breaker, () -> paymentClient.charge(req));

try {
    PaymentResult r = protected.get();
} catch (CallNotPermittedException open) {
    // breaker is OPEN — return cached price, queue, or 503 fast
    return fallback.handle(req);
}

Don’t catch the wrong things

recordExceptions and ignoreExceptions are critical and easy to get wrong. A 401 from a downstream is not an infrastructure failure — it’s a business outcome. If you record it, the breaker will trip on a wave of unauthorized requests and you’ll cut off paying customers. Only record exceptions that mean “the dependency is sick.”

Retry with Exponential Backoff

Why Retries Need Math

The Problem: A naive retry: 3 means that when 1,000 callers hit a flaky service, you immediately turn 1,000 requests into 3,000.

The Solution: Exponential backoff with jitter spreads retries out in time so they don’t synchronize into a thundering herd.

Three things make a retry policy safe:

  1. Exponential backoff — wait 1s, then 2s, then 4s, then 8s. Gives the dependency time to breathe.
  2. Jitter — randomize each delay so 1,000 callers don’t all retry at exactly the same instant.
  3. A budget — cap retries (3–5 attempts), cap total wait time, and only retry idempotent operations.
import random, time
from typing import Callable, TypeVar

T = TypeVar("T")

def retry_with_backoff(
    fn: Callable[[], T],
    max_attempts: int = 5,
    base_delay: float = 1.0,         # seconds
    max_delay: float = 30.0,
    multiplier: float = 2.0,
    retryable: tuple = (TimeoutError, ConnectionError),
) -> T:
    last_exc = None
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            last_exc = e
            if attempt == max_attempts - 1:
                break
            # Full jitter: pick a random delay in [0, exp_backoff)
            backoff = min(max_delay, base_delay * (multiplier ** attempt))
            sleep = random.uniform(0, backoff)
            time.sleep(sleep)
    raise last_exc

Three flavors of jitter

  • Full jittersleep = random(0, backoff). Maximally spreads retries; AWS recommends this.
  • Equal jittersleep = backoff/2 + random(0, backoff/2). Compromise between predictable and spread.
  • Decorrelated jitter — uses the previous sleep to compute the next; smoother under sustained load.

The wrong answer is no jitter.

Only retry what is safe to retry

A retried POST /charge can charge a customer twice. Either: (a) only retry idempotent verbs (GET, PUT, DELETE), (b) require an idempotency key on every write so the server can deduplicate, or (c) accept eventual consistency and reconcile later. Retries without idempotency are how outages turn into incidents.

The Bulkhead Pattern

Why Isolate Resources

The Problem: If every dependency shares the same thread pool, the slowest one starves all the others. One sick service drags everyone down.

The Solution: A bulkhead — named after a ship’s watertight compartments — gives each dependency its own pool. When one floods, the others stay dry.

Real-World Analogy

A ship doesn’t have one giant hull. It has watertight compartments. If the hull is breached in section 3, water fills section 3 and stops. The ship limps along; it doesn’t sink. Microservice bulkheads do the same with thread pools, connection pools, and queues.

Bulkhead TypeWhat It LimitsWhen to Use
Thread poolConcurrent threads per dependencyBlocking I/O, library calls you can’t make async
SemaphoreConcurrent in-flight callsAsync / non-blocking I/O
Connection poolConcurrent DB or HTTP connectionsAnywhere connection limits matter
QueueWork waiting to run, with shed-loadAsync pipelines, message handlers

Semaphore bulkhead in Node.js

class Bulkhead {
    constructor(maxConcurrent = 10, maxQueue = 50) {
        this.maxConcurrent = maxConcurrent;
        this.maxQueue = maxQueue;
        this.inFlight = 0;
        this.queue = [];
    }

    async execute(task) {
        if (this.inFlight >= this.maxConcurrent) {
            if (this.queue.length >= this.maxQueue) {
                throw new Error('bulkhead full: shedding load');
            }
            await new Promise(resolve => this.queue.push(resolve));
        }
        this.inFlight++;
        try {
            return await task();
        } finally {
            this.inFlight--;
            const next = this.queue.shift();
            if (next) next();
        }
    }
}

// Each dependency gets its own bulkhead
const payments = new Bulkhead(5, 10);   // max 5 in-flight, 10 queued
const shipping = new Bulkhead(20, 50);  // shipping gets more headroom

await payments.execute(() => chargeCard(order));
await shipping.execute(() => createLabel(order));

Notice the load-shedding behavior: when both maxConcurrent and maxQueue are full, the bulkhead throws immediately. That is correct. A bulkhead that quietly accepts unlimited work is not a bulkhead; it’s a memory leak with extra steps.

Combining Patterns: Defense in Depth

None of these patterns is sufficient on its own. In production you compose them. The order matters — what wraps what determines what fails first.

The decoration order I recommend

  1. Bulkhead (outermost) — first line of defense; reject if no capacity.
  2. Circuit breaker — if the dependency is sick, fail fast.
  3. Timeout — cap how long a single attempt can run.
  4. Retry — only retry transient failures.
  5. Fallback (innermost) — last-ditch alternative response.

A retry that is inside a circuit breaker is good: failed retries count toward tripping the breaker. A retry outside the breaker bypasses it — almost always wrong.

// Resilience4j composition (Java)
Supplier<Result> protected = Decorators.ofSupplier(() -> service.call(req))
    .withTimeLimiter(timeLimiter)        // 4. one-attempt timeout
    .withRetry(retry)                    // 5. retry transients
    .withCircuitBreaker(breaker)        // 2. fail fast if sick
    .withBulkhead(bulkhead)              // 1. shed load
    .withFallback(List.of(CallNotPermittedException.class),
                  ex -> cache.getLastKnownGood(req))  // 6. fallback
    .decorate();

Result r = protected.get();

The fallback is part of your contract

A fallback that returns stale cache, an empty list, or a default value is visible to callers. If your “graceful degradation” is showing the user yesterday’s prices, customers need to know that’s a possibility (or you need to cap how stale you’ll go). A silent fallback is a footgun. Log every fallback fire and alert when the rate spikes.

Chaos Engineering

Why Break Things on Purpose

The Problem: Every resilience pattern in this tutorial is untested until it has actually fired in production. The day you find out your circuit breaker is misconfigured should not be the day a real outage happens.

The Solution: Inject controlled failures — latency, errors, instance kills — in test and (eventually) production, with monitoring, in business hours, with a rollback plan.

Netflix invented this discipline with Chaos Monkey (kills random instances) and the Simian Army (latency, region outages, etc.). Today the standard tools are LitmusChaos and Chaos Mesh on Kubernetes, Gremlin as a SaaS, and AWS Fault Injection Simulator for AWS workloads.

ExperimentWhat It TestsTypical Setup
Latency injectionSlow dependency handlingAdd 2–5 s delay to a % of calls
Error injectionCircuit breaker tuningReturn 5xx for 5–20% of calls
Instance terminationFailover, rebalancingSIGKILL random pods
Network partitionSplit-brain, quorumiptables-drop traffic between AZs
CPU/memory pressureAutoscaling, evictionRun a stressor next to the workload
DNS failureCaching, fallback hostsBlock resolver for 30 s

The hypothesis-driven shape of an experiment

# A real chaos experiment is a written hypothesis, not a button you press.
name: "Payment service: 2s latency on 50% of calls"
hypothesis: |
  When the payment service experiences 2 seconds of added latency on
  half of its requests, the order service’s circuit breaker trips
  within 60 seconds, fallbacks engage, and overall checkout success
  rate stays above 95%.

steady_state:
  - metric: checkout_success_rate
    min: 95
  - metric: order_p99_latency_ms
    max: 3000

attack:
  target: payment-service
  type: latency
  magnitude: 2000ms
  affected_percentage: 50
  duration: 10m

rollback:
  on_steady_state_violation: true
  notify: ["#oncall-payments", "pagerduty:payments"]

Run experiments in this order

  1. Local dev with mocks — cheapest. Validates the code paths exist.
  2. Staging with synthetic traffic — validates configuration and dashboards.
  3. Production, business hours, small blast radius — validates the real system.
  4. Production, broader scope, with explicit hypothesis and rollback — only after the previous three pass cleanly.

Observability

Why Metrics Matter Here

The Problem: A circuit breaker that’s working correctly is invisible — calls just succeed. The day it stops working you find out from your customers.

The Solution: Treat every breaker, retry budget, and bulkhead as a metric source. Alert on state, not just on outcomes.

Every resilience library exposes the same shape of metrics. The non-negotiable ones:

# Prometheus exposition for a Python circuit breaker
from prometheus_client import Counter, Gauge, Histogram

cb_state = Gauge(
    "circuit_breaker_state",
    "0=closed, 1=open, 2=half_open",
    ["service"],
)
cb_calls = Counter(
    "circuit_breaker_calls_total",
    "calls observed by the breaker",
    ["service", "outcome"],   # success | failed | not_permitted | slow
)
cb_open_seconds = Histogram(
    "circuit_breaker_open_duration_seconds",
    "how long each open episode lasts",
    ["service"],
    buckets=[1, 5, 15, 30, 60, 300, 900],
)

def on_state_change(service: str, new_state: State, episode_seconds: float | None):
    cb_state.labels(service=service).set({State.CLOSED: 0, State.OPEN: 1, State.HALF_OPEN: 2}[new_state])
    if new_state == State.CLOSED and episode_seconds is not None:
        cb_open_seconds.labels(service=service).observe(episode_seconds)

Three alerts every breaker should have

  1. Breaker open > 5 min: page someone — the dependency hasn’t recovered.
  2. Fallback rate > 1%: page someone — customers are seeing degraded responses.
  3. Breaker flapping (state changes > N/min): ticket someone — thresholds are misconfigured.

Best Practices

The short list

  • One breaker per dependency. Not per endpoint, not per pod — per logical downstream service.
  • Tune from production data. Default thresholds will be wrong. Read the dashboard for two weeks before tightening.
  • Failures must be classified. Network errors and 5xx count; 4xx and business exceptions don’t.
  • Always pair with timeouts. A breaker with no timeout still hangs threads in the slow case.
  • Always pair with bulkheads. A breaker stops new traffic; the bulkhead protects you while it trips.
  • Test fallbacks like real code. They run during outages — which is the worst time to discover a NullPointerException in your “safe” path.
  • Run game days. Quarterly is plenty. The team that has already practiced the failure recovers faster than the team that hasn’t.

How the giants do it

Netflix built Hystrix (now in maintenance; the patterns live on in Resilience4j and Polly) and pioneered chaos engineering with the Simian Army. Every Netflix service-to-service call goes through a circuit breaker by default. The blast radius of a single bad service is bounded by design.

Amazon uses cell-based architecture: each service is split into independent cells, customers are assigned to cells, and cells fail independently. A bug in one cell affects ~1/N of users instead of 100%. Combined with shuffle sharding (each customer’s cell membership is randomized) the probability of any two customers sharing all the same failure dependencies approaches zero.

Google’s SRE practice codifies the philosophy: everything fails, all the time. Build for it. The error budget you set at the SLO level is the budget that resilience patterns are built to spend.

The single most useful sentence about resilience

Whenever you find yourself writing “but the downstream shouldn’t fail like that” — stop. The downstream will fail like that. The point of these patterns is that you don’t need to predict how it will fail to handle it gracefully when it does.