Circuit Breaker & Resilience | LIZIU Microservices

Why Resilience Matters

Why Circuit Breaker Matters

The Problem: In a monolith, a slow dependency hurts one process. In microservices, a slow dependency can cascade — Service A’s threads queue waiting on Service B, then Service C’s threads queue waiting on A, until the entire mesh is wedged.

The Solution: A circuit breaker watches calls to a dependency. After enough failures it fails fast instead of waiting, so callers free their threads, recover, and (importantly) your monitoring still works.

Real Impact: Netflix runs thousands of services. The circuit breaker is the reason a bad recommendation service doesn’t take down playback.

Real-World Analogy

Think of an electrical panel in your house:

Closed breaker = electricity flows normally to the outlet
Short circuit = the breaker trips OPEN to stop current and prevent fire
You reset it after fixing the problem — it goes back to closed
Without a breaker, a short anywhere in the house can burn the whole place down

A software circuit breaker does the same thing for service calls — it stops the flow when the downstream is on fire so your service doesn’t catch fire too.

Failures in distributed systems are not exceptional events. They are the steady state. Network partitions, GC pauses, deploy rollouts, full disks, exhausted thread pools — every day, somewhere in your stack, something is degraded. The job of resilience patterns is to make sure those failures stay localized instead of becoming systemic.

The Cascading Failure Problem

Cascades follow a predictable shape:

A downstream dependency slows from 50 ms to 5 s — for any reason: GC, load spike, sick host, dependent-of-dependent failure.
Callers don’t notice. They keep sending requests, each one now occupying a thread for 5 s instead of 50 ms.
The caller’s thread pool fills up. New requests pile up in the queue.
The caller starts rejecting requests, returning 503 to its callers.
Those callers retry — adding more load to a system that’s already underwater.
The blast radius doubles every hop. Within minutes, the whole mesh is down.

The dangerous step is #2: callers don’t notice. By the time a human sees the alert, the damage is already pinned across the system. A circuit breaker breaks step #2.

Failure Modes to Defend Against

Failure Mode	What It Looks Like	Pattern That Helps
Slow dependency	P99 latency climbs from 50 ms to 5 s	Timeout + circuit breaker
Failed dependency	Errors above some threshold	Circuit breaker + fallback
Transient flake	1 in 100 calls fails for no reason	Retry with backoff + jitter
Resource exhaustion	One bad caller eats all DB connections	Bulkhead
Retry storm	Everyone retries at once when service recovers	Jittered backoff + rate limiting
Thundering herd	Cache expires, traffic floods origin	Request coalescing + jitter

The Circuit Breaker Pattern

Why a State Machine

The Problem: “Just stop calling the broken service” sounds simple, but you also need to know when it’s healthy again without DDoSing it during recovery.

The Solution: A three-state machine — Closed, Open, Half-Open — that lets traffic through normally, blocks it during failure, and carefully probes during recovery.

Every circuit breaker library — Resilience4j, Hystrix, Polly, gobreaker — implements the same three states:

State definitions

Closed: The default. Calls pass through and the breaker counts successes and failures over a sliding window. If the failure rate exceeds the threshold, it trips OPEN.
Open: Calls do not reach the dependency. They fail immediately (often with a fallback). The breaker stays open for a cooldown period — long enough for the dependency to recover, short enough that you’re not blind to recovery.
Half-Open: After cooldown, a small number of probe calls are allowed through. If they succeed, the breaker returns to Closed. If they fail, it goes straight back to Open and resets the cooldown.

What to count, and over what window

Naive implementations count consecutive failures. Production implementations count failure rate over a sliding window. The difference matters: 5 failures in a row out of 5 calls is unambiguous; 5 failures out of 10,000 calls is noise.

Sliding window size: “the last N calls” or “the last N seconds”. 10–100 calls is typical.
Minimum number of calls: Don’t trip on a sample of 2. Wait until you have at least 5–10 data points.
Failure rate threshold: 50% is a common default. Lower for critical paths, higher for best-effort calls.
Slow-call detection: A call that times out is a failure. A call that takes 5x longer than P99 is also a failure even if it eventually succeeds.

Building a Circuit Breaker

Why Implement, Not Just Use a Library

The Problem: Libraries give you the right defaults, but you’ll only trust the breaker in production if you understand the state machine well enough to debug it at 2 AM.

The Solution: Build a from-scratch breaker once to internalize the model. Then use a library — Resilience4j, gobreaker, Polly — for everything real.

From scratch in Python

import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar
from collections import deque

T = TypeVar("T")

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreakerOpen(Exception): pass

@dataclass
class CircuitBreaker:
    failure_rate_threshold: float = 0.5     # 50% failure rate
    minimum_calls: int = 10                 # before evaluating
    window_size: int = 20                   # sliding window
    cooldown_seconds: float = 30            # OPEN -> HALF_OPEN delay
    half_open_max_calls: int = 3            # probes before deciding

    state: State = State.CLOSED
    opened_at: float = 0.0
    half_open_calls: int = 0
    history: deque = field(default_factory=lambda: deque(maxlen=20))

    def call(self, fn: Callable[[], T]) -> T:
        self._maybe_transition_to_half_open()

        if self.state == State.OPEN:
            raise CircuitBreakerOpen("breaker is open")

        if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max_calls:
            raise CircuitBreakerOpen("half-open quota exhausted")

        try:
            result = fn()
            self._record(True)
            return result
        except Exception:
            self._record(False)
            raise

    def _record(self, success: bool) -> None:
        self.history.append(success)
        if self.state == State.HALF_OPEN:
            self.half_open_calls += 1
            if not success:
                self._open()
            elif self.half_open_calls >= self.half_open_max_calls:
                self._close()
        elif self.state == State.CLOSED and self._should_open():
            self._open()

    def _should_open(self) -> bool:
        if len(self.history) < self.minimum_calls:
            return False
        failures = sum(1 for ok in self.history if not ok)
        return failures / len(self.history) >= self.failure_rate_threshold

    def _open(self) -> None:
        self.state = State.OPEN
        self.opened_at = time.monotonic()
        self.half_open_calls = 0

    def _close(self) -> None:
        self.state = State.CLOSED
        self.history.clear()
        self.half_open_calls = 0

    def _maybe_transition_to_half_open(self) -> None:
        if self.state == State.OPEN and time.monotonic() - self.opened_at >= self.cooldown_seconds:
            self.state = State.HALF_OPEN
            self.half_open_calls = 0

About 60 lines of straightforward Python and you have most of what a real breaker does. Notice what we deliberately did not do: this implementation is single-threaded. In production, every state transition needs a lock or atomic operations, because callers will share the same breaker across threads.

Production setup with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import java.time.Duration;
import java.util.function.Supplier;

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                          // trip at 50% failure rate
    .slowCallRateThreshold(50)                         // also trip on too many slow calls
    .slowCallDurationThreshold(Duration.ofSeconds(2))   // what counts as “slow”
    .waitDurationInOpenState(Duration.ofSeconds(30))    // cooldown before half-open
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)                            // last 20 calls
    .minimumNumberOfCalls(10)                         // don’t evaluate until 10 samples
    .permittedNumberOfCallsInHalfOpenState(3)         // probe with 3 calls
    .recordExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class)        // 4xx-style errors aren’t infra failures
    .build();

CircuitBreaker breaker = CircuitBreaker.of("payment-service", config);

Supplier<PaymentResult> protected = CircuitBreaker
    .decorateSupplier(breaker, () -> paymentClient.charge(req));

try {
    PaymentResult r = protected.get();
} catch (CallNotPermittedException open) {
    // breaker is OPEN — return cached price, queue, or 503 fast
    return fallback.handle(req);
}

Don’t catch the wrong things

recordExceptions and ignoreExceptions are critical and easy to get wrong. A 401 from a downstream is not an infrastructure failure — it’s a business outcome. If you record it, the breaker will trip on a wave of unauthorized requests and you’ll cut off paying customers. Only record exceptions that mean “the dependency is sick.”

Retry with Exponential Backoff

Why Retries Need Math

The Problem: A naive retry: 3 means that when 1,000 callers hit a flaky service, you immediately turn 1,000 requests into 3,000.

The Solution: Exponential backoff with jitter spreads retries out in time so they don’t synchronize into a thundering herd.

Three things make a retry policy safe:

Exponential backoff — wait 1s, then 2s, then 4s, then 8s. Gives the dependency time to breathe.
Jitter — randomize each delay so 1,000 callers don’t all retry at exactly the same instant.
A budget — cap retries (3–5 attempts), cap total wait time, and only retry idempotent operations.

import random, time
from typing import Callable, TypeVar

T = TypeVar("T")

def retry_with_backoff(
    fn: Callable[[], T],
    max_attempts: int = 5,
    base_delay: float = 1.0,         # seconds
    max_delay: float = 30.0,
    multiplier: float = 2.0,
    retryable: tuple = (TimeoutError, ConnectionError),
) -> T:
    last_exc = None
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            last_exc = e
            if attempt == max_attempts - 1:
                break
            # Full jitter: pick a random delay in [0, exp_backoff)
            backoff = min(max_delay, base_delay * (multiplier ** attempt))
            sleep = random.uniform(0, backoff)
            time.sleep(sleep)
    raise last_exc

Three flavors of jitter

Full jitter — sleep = random(0, backoff). Maximally spreads retries; AWS recommends this.
Equal jitter — sleep = backoff/2 + random(0, backoff/2). Compromise between predictable and spread.
Decorrelated jitter — uses the previous sleep to compute the next; smoother under sustained load.

The wrong answer is no jitter.

Only retry what is safe to retry

A retried POST /charge can charge a customer twice. Either: (a) only retry idempotent verbs (GET, PUT, DELETE), (b) require an idempotency key on every write so the server can deduplicate, or (c) accept eventual consistency and reconcile later. Retries without idempotency are how outages turn into incidents.

The Bulkhead Pattern

Why Isolate Resources

The Problem: If every dependency shares the same thread pool, the slowest one starves all the others. One sick service drags everyone down.

The Solution: A bulkhead — named after a ship’s watertight compartments — gives each dependency its own pool. When one floods, the others stay dry.

Real-World Analogy

A ship doesn’t have one giant hull. It has watertight compartments. If the hull is breached in section 3, water fills section 3 and stops. The ship limps along; it doesn’t sink. Microservice bulkheads do the same with thread pools, connection pools, and queues.

Bulkhead Type	What It Limits	When to Use
Thread pool	Concurrent threads per dependency	Blocking I/O, library calls you can’t make async
Semaphore	Concurrent in-flight calls	Async / non-blocking I/O
Connection pool	Concurrent DB or HTTP connections	Anywhere connection limits matter
Queue	Work waiting to run, with shed-load	Async pipelines, message handlers

Semaphore bulkhead in Node.js

class Bulkhead {
    constructor(maxConcurrent = 10, maxQueue = 50) {
        this.maxConcurrent = maxConcurrent;
        this.maxQueue = maxQueue;
        this.inFlight = 0;
        this.queue = [];
    }

    async execute(task) {
        if (this.inFlight >= this.maxConcurrent) {
            if (this.queue.length >= this.maxQueue) {
                throw new Error('bulkhead full: shedding load');
            }
            await new Promise(resolve => this.queue.push(resolve));
        }
        this.inFlight++;
        try {
            return await task();
        } finally {
            this.inFlight--;
            const next = this.queue.shift();
            if (next) next();
        }
    }
}

// Each dependency gets its own bulkhead
const payments = new Bulkhead(5, 10);   // max 5 in-flight, 10 queued
const shipping = new Bulkhead(20, 50);  // shipping gets more headroom

await payments.execute(() => chargeCard(order));
await shipping.execute(() => createLabel(order));

Notice the load-shedding behavior: when both maxConcurrent and maxQueue are full, the bulkhead throws immediately. That is correct. A bulkhead that quietly accepts unlimited work is not a bulkhead; it’s a memory leak with extra steps.

Combining Patterns: Defense in Depth

None of these patterns is sufficient on its own. In production you compose them. The order matters — what wraps what determines what fails first.

The decoration order I recommend

Bulkhead (outermost) — first line of defense; reject if no capacity.
Circuit breaker — if the dependency is sick, fail fast.
Timeout — cap how long a single attempt can run.
Retry — only retry transient failures.
Fallback (innermost) — last-ditch alternative response.

A retry that is inside a circuit breaker is good: failed retries count toward tripping the breaker. A retry outside the breaker bypasses it — almost always wrong.

// Resilience4j composition (Java)
Supplier<Result> protected = Decorators.ofSupplier(() -> service.call(req))
    .withTimeLimiter(timeLimiter)        // 4. one-attempt timeout
    .withRetry(retry)                    // 5. retry transients
    .withCircuitBreaker(breaker)        // 2. fail fast if sick
    .withBulkhead(bulkhead)              // 1. shed load
    .withFallback(List.of(CallNotPermittedException.class),
                  ex -> cache.getLastKnownGood(req))  // 6. fallback
    .decorate();

Result r = protected.get();

The fallback is part of your contract

A fallback that returns stale cache, an empty list, or a default value is visible to callers. If your “graceful degradation” is showing the user yesterday’s prices, customers need to know that’s a possibility (or you need to cap how stale you’ll go). A silent fallback is a footgun. Log every fallback fire and alert when the rate spikes.

Chaos Engineering

Why Break Things on Purpose

The Problem: Every resilience pattern in this tutorial is untested until it has actually fired in production. The day you find out your circuit breaker is misconfigured should not be the day a real outage happens.

The Solution: Inject controlled failures — latency, errors, instance kills — in test and (eventually) production, with monitoring, in business hours, with a rollback plan.

Netflix invented this discipline with Chaos Monkey (kills random instances) and the Simian Army (latency, region outages, etc.). Today the standard tools are LitmusChaos and Chaos Mesh on Kubernetes, Gremlin as a SaaS, and AWS Fault Injection Simulator for AWS workloads.

Experiment	What It Tests	Typical Setup
Latency injection	Slow dependency handling	Add 2–5 s delay to a % of calls
Error injection	Circuit breaker tuning	Return 5xx for 5–20% of calls
Instance termination	Failover, rebalancing	SIGKILL random pods
Network partition	Split-brain, quorum	iptables-drop traffic between AZs
CPU/memory pressure	Autoscaling, eviction	Run a stressor next to the workload
DNS failure	Caching, fallback hosts	Block resolver for 30 s

The hypothesis-driven shape of an experiment

# A real chaos experiment is a written hypothesis, not a button you press.
name: "Payment service: 2s latency on 50% of calls"
hypothesis: |
  When the payment service experiences 2 seconds of added latency on
  half of its requests, the order service’s circuit breaker trips
  within 60 seconds, fallbacks engage, and overall checkout success
  rate stays above 95%.

steady_state:
  - metric: checkout_success_rate
    min: 95
  - metric: order_p99_latency_ms
    max: 3000

attack:
  target: payment-service
  type: latency
  magnitude: 2000ms
  affected_percentage: 50
  duration: 10m

rollback:
  on_steady_state_violation: true
  notify: ["#oncall-payments", "pagerduty:payments"]

Run experiments in this order

Local dev with mocks — cheapest. Validates the code paths exist.
Staging with synthetic traffic — validates configuration and dashboards.
Production, business hours, small blast radius — validates the real system.
Production, broader scope, with explicit hypothesis and rollback — only after the previous three pass cleanly.

Observability

Why Metrics Matter Here

The Problem: A circuit breaker that’s working correctly is invisible — calls just succeed. The day it stops working you find out from your customers.

The Solution: Treat every breaker, retry budget, and bulkhead as a metric source. Alert on state, not just on outcomes.

Every resilience library exposes the same shape of metrics. The non-negotiable ones:

circuit_breaker_state — gauge per service (0=closed, 1=open, 2=half-open). Alert on any breaker open longer than N minutes.
circuit_breaker_calls_total — counter labeled by outcome (success / failed / not_permitted / slow).
retry_calls_total — counter labeled by attempt number. Spike in attempt=4 means the dependency is shaky.
bulkhead_available_concurrent_calls — gauge. If it sits at 0, you’re shedding load.
fallback_invocations_total — counter. The hidden cost of resilience — if this number is non-zero, customers are seeing degraded responses.

# Prometheus exposition for a Python circuit breaker
from prometheus_client import Counter, Gauge, Histogram

cb_state = Gauge(
    "circuit_breaker_state",
    "0=closed, 1=open, 2=half_open",
    ["service"],
)
cb_calls = Counter(
    "circuit_breaker_calls_total",
    "calls observed by the breaker",
    ["service", "outcome"],   # success | failed | not_permitted | slow
)
cb_open_seconds = Histogram(
    "circuit_breaker_open_duration_seconds",
    "how long each open episode lasts",
    ["service"],
    buckets=[1, 5, 15, 30, 60, 300, 900],
)

def on_state_change(service: str, new_state: State, episode_seconds: float | None):
    cb_state.labels(service=service).set({State.CLOSED: 0, State.OPEN: 1, State.HALF_OPEN: 2}[new_state])
    if new_state == State.CLOSED and episode_seconds is not None:
        cb_open_seconds.labels(service=service).observe(episode_seconds)

Three alerts every breaker should have

Breaker open > 5 min: page someone — the dependency hasn’t recovered.
Fallback rate > 1%: page someone — customers are seeing degraded responses.
Breaker flapping (state changes > N/min): ticket someone — thresholds are misconfigured.

Best Practices

The short list

One breaker per dependency. Not per endpoint, not per pod — per logical downstream service.
Tune from production data. Default thresholds will be wrong. Read the dashboard for two weeks before tightening.
Failures must be classified. Network errors and 5xx count; 4xx and business exceptions don’t.
Always pair with timeouts. A breaker with no timeout still hangs threads in the slow case.
Always pair with bulkheads. A breaker stops new traffic; the bulkhead protects you while it trips.
Test fallbacks like real code. They run during outages — which is the worst time to discover a NullPointerException in your “safe” path.
Run game days. Quarterly is plenty. The team that has already practiced the failure recovers faster than the team that hasn’t.

How the giants do it

Netflix built Hystrix (now in maintenance; the patterns live on in Resilience4j and Polly) and pioneered chaos engineering with the Simian Army. Every Netflix service-to-service call goes through a circuit breaker by default. The blast radius of a single bad service is bounded by design.

Amazon uses cell-based architecture: each service is split into independent cells, customers are assigned to cells, and cells fail independently. A bug in one cell affects ~1/N of users instead of 100%. Combined with shuffle sharding (each customer’s cell membership is randomized) the probability of any two customers sharing all the same failure dependencies approaches zero.

Google’s SRE practice codifies the philosophy: everything fails, all the time. Build for it. The error budget you set at the SLO level is the budget that resilience patterns are built to spend.

The single most useful sentence about resilience

Whenever you find yourself writing “but the downstream shouldn’t fail like that” — stop. The downstream will fail like that. The point of these patterns is that you don’t need to predict how it will fail to handle it gracefully when it does.