Design Patterns

Decomposition, integration, resilience, data, migration, observability — the catalog of moves you reach for when building or breaking up a distributed system. Patterns are not the goal; solving the problem in front of you is. Pick deliberately.

Medium30 min read

Why Patterns Matter for Microservices

Why Design Patterns Matter

The Problem: Distributed systems break in ways monoliths never did — partial failure, network latency, eventual consistency, deployment skew. Inventing a fix from first principles every time is how teams ship distributed monoliths.

The Solution: A shared vocabulary of patterns — Strangler Fig, BFF, Saga, Sidecar — that have been beaten on by Netflix, Amazon, Uber, and Spotify. Patterns are not blueprints; they are named trade-offs. Knowing the name lets you reach for the right one and skip the ones that don’t fit.

Real Impact: Sam Newman’s Building Microservices, Chris Richardson’s Microservices Patterns, and Martin Fowler’s microservices.io codify the catalog. Engineers who can talk in patterns argue faster and ship safer.

Real-World Analogy

Patterns are like cooking techniques. Braising, searing, emulsifying — none of them is a recipe. They’re named ways of solving a class of problem. A good cook reaches for the right technique without thinking; a bad cook follows recipes literally and is helpless when something is missing.

The same is true here. “Use the Saga pattern” is not a recipe — it’s a technique. The recipe is your specific business workflow. The pattern just tells you which moves are safe.

Microservices design patterns are organized into seven families — one per problem domain you encounter in production. The rest of this tutorial walks each family in the order you typically meet them: split the monolith, glue it back together, share infrastructure concerns, survive failure, manage data, migrate from legacy, see what is happening.

FamilyProblem It SolvesHeadline Patterns
DecompositionHow to draw service boundariesDecompose by Business Capability, by Subdomain (DDD)
IntegrationHow clients reach many servicesAPI Gateway, BFF, Aggregator
Cross-CuttingShared concerns without shared codeSidecar, Ambassador, Adapter
ResilienceSurviving partial failureCircuit Breaker, Bulkhead, Retry, Timeout
DataConsistency across service boundariesDatabase-per-Service, Saga, CQRS, Event Sourcing
MigrationMoving off a monolith without a rewriteStrangler Fig, Anti-Corruption Layer, Branch by Abstraction
ObservabilityKnowing what just happened, across N servicesDistributed Tracing, Health Check API, Log Aggregation

Decomposition Patterns

Before you can build a microservice, you have to decide what it is. Decomposition patterns are how you draw the lines. Get this wrong and every other pattern in this tutorial will fight you.

Decompose by Business Capability

A business capability is something the organization doesCatalog, Pricing, Checkout, Fulfillment. Each becomes a service. The advantage is stability: capabilities change much more slowly than technologies or org charts. Amazon’s decomposition into Cart, Checkout, Payments, Recommendations is the canonical example.

Decompose by Subdomain (DDD)

Eric Evans’ Domain-Driven Design gives the more rigorous version: identify bounded contexts, each with its own ubiquitous language, and let each context become (at most) one service. The strength of DDD is precision — it gives you a way to argue about boundaries instead of just guessing.

Three kinds of subdomain

  • Core: Where you make money. Build it yourself, invest deeply. (Netflix recommendation engine.)
  • Supporting: Necessary, not differentiating. Build minimally or buy. (Internal admin tooling.)
  • Generic: Solved problems. Buy or use SaaS. (Auth, billing, email delivery.)

Spending core energy on generic subdomains is the most common decomposition mistake.

# A bounded context defines a model that’s consistent inside, fuzzy outside.
# “Customer” means different things to different services:

class Customer:                  # Sales context
    id: str
    lifetime_value: float
    pipeline_stage: str

class Customer:                  # Support context
    id: str
    open_tickets: list[Ticket]
    csat_score: float

class Customer:                  # Billing context
    id: str
    payment_method: PaymentMethod
    invoice_history: list[Invoice]

Three classes, same name, three services. Each is internally complete and externally minimal. That is the shape of a clean decomposition.

Self-Contained Service

The acid test from Newman: can the team that owns this service ship a feature without coordinating with another team? If the answer is “only on Tuesdays,” the service is not self-contained — you have a distributed monolith. The fix is almost always to move data and behavior to the service that owns the decision, not to add another sync API.

Integration Patterns

Once services exist, clients have to reach them. The naive answer — let every client call every service directly — falls apart the moment you have a mobile app, a web app, and a public API with different needs. Integration patterns put a thin shim in between.

API Gateway

A single entry point that routes, authenticates, rate-limits, and translates. Every public-facing microservices system has one, whether built (Kong, Envoy, Spring Cloud Gateway) or bought (AWS API Gateway, Apigee).

ResponsibilityWhy It Belongs at the Edge
RoutingClients shouldn’t know about service topology
AuthenticationVerify tokens once, propagate identity inward
Rate limitingShed abusive traffic before it hits any service
Request/response transformationSmooth over service version skew
TLS terminationCentralize cert management
ObservabilitySingle place to log every external request

For a deep dive on building one, see the dedicated API Gateway tutorial. The high-order rule is: the gateway is dumb infrastructure, not a place for business logic. Every line of business code that creeps into the gateway makes deploys slower and outages worse.

Backend for Frontend (BFF)

Sam Newman’s BFF pattern says: don’t make one API serve every client. iOS, Android, and Web have different screen sizes, latency budgets, and feature sets. Give each a tailored backend that calls the underlying services and shapes the response.

// Mobile BFF: aggressive aggregation, payload tuned for cellular
app.get('/mobile/home', async (req, res) => {
    const [user, feed, notifications] = await Promise.all([
        userService.getProfile(req.userId),
        feedService.getRecent(req.userId, { limit: 10 }),
        notificationService.getUnread(req.userId, { limit: 5 }),
    ]);
    // One round trip; thumbnail URLs only; no full-resolution images
    res.json({ user: trimForMobile(user), feed: thumbnailize(feed), notifications });
});

// Web BFF: richer payload, more pre-fetched data, server-side rendering helpers
app.get('/web/home', async (req, res) => {
    // Different aggregation; same downstream services
});

BFF is not a free lunch

You now own N edge services that all do similar-but-different things. Drift is inevitable. Mitigations: shared client SDK for downstream services, shared contract test suite, ruthless code review against feature creep. If your BFFs start asking each other questions, you have invented a new monolith.

Aggregator

A specialized BFF: a service whose only job is to fan out a client request to several backend services and combine the result. Useful when the same composition is needed by several callers (mobile + web + partners), so you don’t duplicate it in three BFFs.

Cross-Cutting Patterns

TLS, retries, metrics, log shipping, secret rotation — every service needs them, and you do not want to reimplement them per language. Cross-cutting patterns let infrastructure ride alongside the application without being inside it.

Sidecar

A helper process that runs in the same pod (or VM) as your service and handles a cross-cutting concern. The application talks to localhost; the sidecar handles the network. This is exactly how Istio and Linkerd implement service mesh — an Envoy sidecar in every pod, intercepting all inbound and outbound traffic.

Sidecar JobExample
mTLS, retries, circuit breakingEnvoy / Linkerd-proxy
Log shippingFluent Bit, Vector
Secret fetch and rotationVault Agent
Local config / feature flagsConsul Agent
Metrics scraping/proxyingOpenTelemetry Collector

The win is enormous: a Java service, a Python service, and a Go service get the same mTLS, retries, and observability without any of them shipping a new library. The cost is real: every pod now runs an extra container with its own CPU and memory footprint, and your control plane is doing a lot of work.

Ambassador

A flavor of sidecar focused on outbound traffic: the application calls the ambassador on localhost, and the ambassador handles the messy parts of talking to the outside world — retries, circuit breaking, request signing, monitoring. Useful when your application can’t (legacy code) or shouldn’t (security boundary) handle that itself.

Adapter

A sidecar that translates between your service’s native interface and the rest of the system — for instance, exposing a Prometheus /metrics endpoint for a service that only emits StatsD. Most useful when wrapping legacy or third-party software you can’t modify.

Resilience Patterns

Why Resilience Patterns Matter

The Problem: A monolith fails as one unit; a distributed system fails partially, all the time. Network blips, GC pauses, slow dependencies, retry storms.

The Solution: Compose a defense-in-depth stack — timeouts, retries with jitter, circuit breakers, bulkheads, fallbacks — so that a single sick downstream cannot take everything else with it.

This is the deepest family in the catalog and gets a tutorial of its own. Read the full treatment in Circuit Breaker & Resilience. Here is the short version.

PatternWhat It DoesCompose Order
TimeoutCap how long a single attempt can runInnermost; everything else assumes timeouts exist
Retry (with jittered backoff)Recover from transient failure without stormingInside the breaker, never outside
Circuit BreakerFail fast when a dependency is sickWraps timeouts and retries
BulkheadIsolate resource pools per dependencyOutermost; first line of defense
FallbackLast-ditch alternative responseFinal wrap; visible to the user
// Resilience4j composition (Java) — order matters
Supplier<Result> protected = Decorators.ofSupplier(() -> service.call(req))
    .withTimeLimiter(timeLimiter)        // per-attempt timeout
    .withRetry(retry)                    // retry transients
    .withCircuitBreaker(breaker)        // fail fast if sick
    .withBulkhead(bulkhead)              // shed load
    .withFallback(List.of(CallNotPermittedException.class),
                  ex -> cache.getLastKnownGood(req))
    .decorate();

Resilience without idempotency is not resilience

A retried POST /charge can charge a customer twice. Either: (a) only retry idempotent verbs, (b) require an idempotency key on every write, or (c) accept eventual consistency and reconcile. Retrying writes without idempotency is how outages turn into incidents and how incidents turn into post-mortems.

Data Patterns

Data is the hardest part of microservices. The full treatment lives in Data Management; this section gives you the named patterns and when to reach for them.

Database per Service

The non-negotiable rule: each service owns its own data store, and no other service touches it directly. Sharing a database silently couples services together — one team’s schema migration becomes another team’s 3 AM page.

Saga

You can’t use a 2PC distributed transaction across services (you shouldn’t even want to). The Saga pattern is the alternative: a sequence of local transactions, each with a compensating action that undoes it if a later step fails. Two flavors:

FlavorCoordinationWhen to Use
ChoreographyEach service listens for events and reacts2–3 steps, simple flow, loose coupling
OrchestrationA central orchestrator (e.g. Temporal) issues commands4+ steps, complex compensation, need to see the workflow
# Orchestrated saga skeleton — the orchestrator is just a state machine.
class SagaOrchestrator:
    def __init__(self):
        self.steps = []
        self.completed = []

    def add_step(self, action, compensation):
        self.steps.append((action, compensation))

    def execute(self):
        try:
            for action, compensation in self.steps:
                action()
                self.completed.append(compensation)
            return {"status": "success"}
        except Exception as e:
            # Compensate in reverse order — undo what we did, not what we hoped to do.
            for compensate in reversed(self.completed):
                try: compensate()
                except Exception as ce:
                    log.error(f"compensation failed: {ce}")
            raise

Uber runs ride-booking sagas on Cadence (now Temporal). Spotify uses event-driven choreography for playlist creation across a dozen services. Both have learned the same lesson: idempotent steps are not optional — events get redelivered, retries happen, and a step that runs twice without complaint is the only kind of step that survives.

CQRS

Command Query Responsibility Segregation: split the write model from the read model. Writes go to a normalized store optimized for consistency; reads are served from one or more denormalized projections optimized for the queries you actually run.

# Write side
class OrderCommandHandler:
    def handle(self, cmd: CreateOrderCommand):
        order = Order.create(cmd)
        write_db.orders.insert(order)
        event_bus.publish(OrderCreated(order_id=order.id, total=order.total))
        return order.id

# Read side — fed by events, denormalized for the query it serves
class OrderProjection:
    def on_order_created(self, e: OrderCreated):
        read_db.order_summaries.insert({
            "order_id": e.order_id,
            "total": e.total,
            "status": "PENDING",
        })

class OrderQueryHandler:
    def get_summary(self, order_id):
        return read_db.order_summaries.find_one({"order_id": order_id})

Use CQRS when read and write workloads have fundamentally different shapes — e.g. you ingest a few thousand transactions/sec but serve millions of dashboard queries off them. Don’t use it when a single normalized table would be fine; the eventual consistency window is real and complicates everything.

Event Sourcing

Instead of storing current state, store the sequence of events that produced it. Current state is a fold over the event log. Capital One uses this for transaction processing — every debit, credit, and adjustment is an event in Kafka, and the account balance is a projection.

TraditionalEvent Sourcing
Store current state, mutate in placeAppend events, derive state on read
History is whatever audit log you remembered to addHistory is the data
Bug in computation = corrupted data foreverBug in computation = fix code, replay events
Schema changes are migrationsSchema changes are new projections

The cost of event sourcing

Storage grows monotonically (snapshots help). Reconstructing state is more expensive than reading a row. Schema-evolution of events is its own discipline. Use it where audit, replay, or temporal queries are first-class requirements — banking, ledgers, anything regulated. Skip it for “normal” CRUD.

Outbox

The dual-write problem: you need to both update the database and publish an event, atomically. Two-phase commit isn’t available across DB and broker. The Outbox pattern solves it by writing the event to an outbox table in the same transaction as the state change, and a separate process tails the outbox and publishes to the broker. Debezium and Kafka Connect make this trivial in 2026.

Migration Patterns

Most microservices systems are not greenfield — they’re carved out of monoliths. Migration patterns are how you do that without a multi-year “big bang” that ships nothing and risks everything. The full treatment is in Migration Strategies.

Strangler Fig

Martin Fowler’s name for the pattern, after the strangler fig tree that grows around a host tree until the host is gone. You put the monolith behind a routing layer (often the API gateway), peel off one capability at a time into a new service, route traffic to the new service, and decommission the old code path when it’s quiet.

  1. Identify a bounded context inside the monolith.
  2. Build a new service that owns that context, with its own database.
  3. Route reads to both, compare results (dark traffic).
  4. Cut writes to the new service, dual-write back to the monolith if needed.
  5. Stop calling the monolith path, then delete the dead code.

Amazon’s migration off its 2000-era monolith to hundreds of services was a multi-year strangler. So was Etsy’s. So is most of yours, if you’re honest. The pattern is correct; the rate of progress is what teams underestimate.

Anti-Corruption Layer

When the new service has to talk to the old monolith (and it will), don’t leak the monolith’s domain model into the new service. Build a thin anti-corruption layer — a translation shim that maps between the two models. Without it, the new service inherits the monolith’s assumptions and you get a microservice with all the constraints of the monolith and none of the upside.

Branch by Abstraction

For changes too large to do behind a single feature flag: introduce an interface in front of the existing implementation, build a new implementation behind the same interface, switch traffic, retire the old implementation. Used heavily by Trunk-Based Development teams to avoid long-lived branches.

Observability Patterns

You cannot operate what you cannot see. In a monolith you grep one log file; in a microservice mesh, that log is scattered across N services and 10*N pods. Observability patterns put it back together.

Distributed Tracing

Every inbound request gets a trace ID. Every outbound call propagates it. A trace collector (Jaeger, Tempo, Honeycomb, Lightstep) reassembles the spans into a waterfall view of one request across all services it touched. OpenTelemetry is the cross-language standard.

# OpenTelemetry instrumentation in a Python FastAPI service
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

FastAPIInstrumentor.instrument_app(app)        # inbound spans
RequestsInstrumentor().instrument()             # outbound HTTP spans

tracer = trace.get_tracer(__name__)

def checkout(order_id: str):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("order.id", order_id)
        charge(order_id)
        ship(order_id)

Health Check API

Every service exposes /health (process is alive) and /ready (process is alive and can serve traffic — DB reachable, dependencies healthy). Kubernetes uses these as liveness and readiness probes; the load balancer uses them to drain unhealthy instances.

Log Aggregation

Ship structured (JSON) logs from every container to a central store — Loki, Elastic, Splunk, Datadog. Every log line carries the trace ID. Now you can pivot from a span in Jaeger to the exact log line that wrote it. The combination of traces, metrics, and logs — the “three pillars” — is the minimum viable production observability stack.

Pattern Selection Heuristics

Patterns are not free. Each one adds moving parts, vocabulary, and a class of failure modes. The job is to apply the smallest set of patterns that solves the problem in front of you.

If your problem is…Reach for…Don’t reach for
Multi-step workflow across servicesSaga (orchestrated if 4+ steps)2PC, distributed transactions
Read load >> write load (10x+)CQRS with denormalized projectionsEvent Sourcing, unless you also need audit
Audit trail / temporal queries / replayEvent Sourcing + CQRSCron jobs that copy DB snapshots
One sick downstream taking the rest with itCircuit Breaker + Bulkhead + Timeout“Just add more retries”
Mobile and web have different needsBFF per client classOne God-API trying to please everyone
Replacing a monolithStrangler Fig + Anti-Corruption LayerBig-bang rewrite
Cross-cutting infra concerns in N languagesSidecar (probably a service mesh)A Java library that nobody else can use
Public API platformAPI Gateway + Rate Limiting + Circuit BreakerDirect service exposure

The first three patterns to adopt

  1. Database per Service — the foundation; everything else assumes it.
  2. Circuit Breaker + Timeout on every cross-service call — cheap, library-provided, prevents your first cascading failure.
  3. Distributed Tracing — you will need it the first time something is “slow” and nobody knows where.

If you have those three and a CI/CD pipeline that can deploy services independently, you have a real microservices system. Everything else can be added when you actually feel the pain.

Anti-patterns to actively avoid

Distributed Monolith

Services with synchronous chains, shared schemas, or coordinated deploys. Symptoms: “we have to deploy A and B together,” “a column rename broke three services.” Fix: own the data, communicate via async events where possible, never share a database.

Chatty Services

One user request triggers 30 sync calls. Each adds latency and a failure mode. Fix: API composition, denormalized projections, event-driven updates, or just merge the services if the boundary was wrong.

Microservices for every feature

Twelve services for a CRUD app run by a four-person team. The operational tax exceeds any architectural benefit. Fix: start with a modular monolith and extract services only when you feel a specific pain (scaling, team autonomy, polyglot tech).

Best Practices

The short list

  • Patterns are tools, not goals. “We use CQRS” is not an architecture; “we use CQRS for the catalog read path because reads are 50x writes” is.
  • Decompose by capability, then check with DDD. If two engineers can’t agree on a service’s name, the boundary is wrong.
  • Database per service is non-negotiable. Sharing the DB is how all the other patterns silently fail.
  • Idempotency is the price of admission. Every saga step, every event handler, every retried call has to be safe to run twice.
  • Async by default, sync by exception. Sync coupling is what creates cascading failure; events let services move at their own pace.
  • Compose resilience patterns deliberately. Bulkhead outside, breaker, timeout, retry, fallback — in that order. A retry outside the breaker bypasses it.
  • Strangle, don’t rewrite. Big-bang rewrites have a track record. It is bad.
  • Observe everything. Traces with IDs, logs with the same IDs, metrics on every breaker/bulkhead/queue. You will need them at 2 AM.

How the giants compose patterns

Netflix is the textbook resilience case study: every service-to-service call is wrapped in a circuit breaker (Hystrix originally, now Resilience4j and Spring Cloud), every dependency lives in a bulkhead, every fallback is exercised by Chaos Monkey before it has to run for real. The blast radius of any one bad service is bounded by design.

Amazon combines decomposition (~hundreds of services), API Gateway (Coral, then API Gateway), and cell-based architecture — a structural pattern where each service is split into independent cells with shuffle-sharded customer assignment, so the probability of any two customers sharing all the same failure modes is near zero. Combined with strangler-fig migration over years, this is how Amazon got from monolith to “deploy every 11.6 seconds.”

Uber runs ride booking as an orchestrated saga on Temporal/Cadence: validate rider, match driver, charge payment, start trip — with compensations at every step. The lesson Uber learned (and published) is that orchestration scales for complex workflows where choreography becomes impossible to reason about.

Spotify built its playlist system as event-driven choreography across a dozen-plus services, processing 100k+ events/second. The headline lesson Spotify keeps re-learning: idempotency is not optional. Events get replayed, services restart mid-stream, and the only handlers that survive are the ones safe to run twice.

The single most useful sentence about patterns

A pattern is a name for a trade-off, not a name for a solution. If you can’t state the trade-off you’re accepting when you adopt one — the eventual-consistency window of a saga, the storage tax of event sourcing, the operational tax of a service mesh — you’re not using the pattern. You’re cosplaying it.