Data Management | LIZIU Microservices

Why Data Management is the Hardest Part

Why Data Matters Most

The Problem: Splitting a monolith into services is a refactor. Splitting the database is a migration that touches every team, every query, every report, and every ad-hoc script anyone has ever written. Most failed microservices migrations failed at the data layer, not the code layer.

The Solution: Accept that you cannot have ACID across services. Then pick the right consistency pattern — database-per-service, saga, outbox, CQRS — for each business flow. The patterns are not interchangeable; choosing the wrong one is how teams end up with distributed monoliths.

Real Impact: eBay, Amazon, LinkedIn, and Netflix all spent years getting their data architecture right after the initial service split. Code is rewritable; data architectures calcify.

Real-World Analogy

A monolith with a shared database is like a shared kitchen in a co-living house:

Everyone has access to every pan, every spice, every fridge shelf.
One bad cook leaves the stove a mess and now nobody can cook.
You can’t change the layout without consulting the whole house.
Conflict is constant. Coordination is exhausting.

Microservices with database-per-service are like separate apartments:

Each tenant owns their own kitchen, fridge, and pantry.
If one apartment floods, the others stay dry.
You want sugar from your neighbor? You knock and ask — you don’t walk in.
The cost is more plumbing. The benefit is independence at scale.

The whole rest of this tutorial is about that “more plumbing.”

Most engineers underestimate this. They will spend three months extracting a payment service from a monolith, deploy it triumphantly, and then discover that the new service still JOINs six tables in the legacy database because nobody wanted to fight that battle. That is not a microservice. That is a deployment unit pretending to be a microservice while the data layer is still a monolith. The pattern has a name: distributed monolith. It has all the operational pain of microservices and none of the independence benefits.

Real microservices means real data ownership. Which means giving up the conveniences of a single relational database: foreign keys across boundaries, multi-table transactions, ad-hoc reporting joins, “just look it up” debugging. In exchange you get independent scaling, independent deploys, blast-radius containment, and the freedom to pick the right database for each workload. The trade is real and it is permanent.

Database-per-Service

Why One Database Per Service

The Problem: A shared database creates invisible coupling. Service A’s schema migration breaks Service B’s queries. Service C’s sloppy SELECT * blocks Service D’s writes. Nothing can be deployed independently.

The Solution: Each service owns its data exclusively. The database is an implementation detail, not an integration surface. Other services interact through the owning service’s API or its published events — never the database directly.

Database-per-service is the foundational rule. Break it and most of the other patterns in this tutorial stop working. The rule is strict: no other service connects to your database. Period. Not for “just a quick read.” Not for the analytics team. Not for the BI dashboard. Other consumers either go through your service’s API or subscribe to events you publish.

Never share a database between services

The single fastest way to ruin a microservices migration is to let two services write to the same table. Once that happens, you cannot change the schema without coordinating both teams. You cannot scale either service independently. You cannot reason about consistency. You have a distributed monolith with extra latency.

This includes “read-only” access. Today’s read-only query becomes tomorrow’s “just one little update” because nobody enforces it.

Polyglot Persistence

Database-per-service unlocks something the monolith couldn’t do: pick the right database engine for each service’s data shape. The order service has transactional, relational data — Postgres. The product catalog needs full-text search and faceted filtering — Elasticsearch or OpenSearch. The cart is ephemeral key-value — Redis or DynamoDB. The recommendation graph is, well, a graph — Neo4j. Each service picks what fits.

Service	Data Shape	Good Fit	Why
Orders	Transactional, relational	Postgres, MySQL	ACID per order, joins on customer + line items
Catalog	Document, searchable	MongoDB, Elasticsearch	Flexible schema per product type, faceted search
Cart	Ephemeral key-value	Redis, DynamoDB	High write rate, TTL-based eviction, low latency
Inventory	Counter-heavy, hot keys	DynamoDB, Cassandra	Write throughput, partition tolerance
Analytics events	Append-only, time-series	ClickHouse, Druid	Bulk ingest, columnar aggregation
Recommendations	Graph relationships	Neo4j, Neptune	Multi-hop traversals are awkward in SQL
User sessions	Short-lived key-value	Redis, Memcached	Sub-millisecond reads, TTL eviction

Polyglot is a tax, not a free lunch

Every new database engine costs you operational expertise — backups, monitoring, upgrades, on-call runbooks, security hardening, capacity planning. A team running five different databases pays five times the operational cost. The right number is “as few as you can get away with.” Most companies should standardize on Postgres + Redis + an object store and only add a fourth when the workload demands it.

Drawing the boundaries

The hard part of database-per-service is figuring out where to draw the lines. Service boundaries should align with bounded contexts (in DDD terms). The signs you drew the line wrong:

Two services frequently call each other in the same request — you split a single transaction across services.
You find yourself doing distributed joins — the data should be co-located.
You need a saga for a flow that feels like one operation — you over-decomposed.
One service’s deploy keeps breaking another — the contract is too thin or wrong.

If any of these are happening, the right answer is usually merge the services back together, not add more coordination patterns. Microservices are a deployment optimization, not a moral imperative.

The Distributed Transaction Problem

Why ACID Doesn’t Cross the Wire

The Problem: “Place an order” touches the order service, payment service, and inventory service. In a monolith this is one database transaction. Across services it is three independent commits. If payment succeeds and inventory fails, you have charged a customer for an out-of-stock item.

The Solution: Stop pretending you can have ACID across services. Embrace eventual consistency, design explicit compensating actions, and choose the saga pattern for multi-service flows.

Two-phase commit and why nobody uses it

The classic textbook answer is two-phase commit (2PC). A coordinator asks every participant “can you commit?”, and only if everyone says yes does it tell them to actually commit. In theory this gives you ACID across services. In practice 2PC is dead for three reasons:

It blocks. Participants hold locks during the whole protocol. A slow participant freezes everyone else. At microservice scale this is unworkable.
It assumes the coordinator never dies. If the coordinator crashes between phase 1 and phase 2, participants are stuck holding locks waiting for an answer that never comes. Recovery is fragile and operationally painful.
Most modern databases don’t support it cleanly across vendors. XA exists but is poorly supported, especially for NoSQL engines, message brokers, and SaaS APIs. You can’t enlist Stripe or DynamoDB in a 2PC.

The serious database literature is brutal on 2PC: Pat Helland’s Life Beyond Distributed Transactions is the canonical “just stop” paper. The industry has spent twenty years agreeing with him.

Embracing eventual consistency

Once 2PC is off the table, you have to accept that your system will pass through temporarily inconsistent states. Order created, payment not yet processed. Payment processed, inventory not yet decremented. The question is no longer “how do I prevent inconsistency” but “how do I bound it, detect it, and recover from it.”

Consistency Model	Latency Window	Right For	Wrong For
Strong (single-DB ACID)	0 ms	Money inside one service’s ledger	Cross-service flows
Read-your-writes	Bounded by session	UX after a write (“I just edited my profile”)	Anything multi-user
Eventual	Seconds to minutes	Most cross-service flows	Real-time pricing, fraud holds
Causal	Bounded by causality chain	Comments-on-posts, threaded UX	Aggregations, counters

The hardest engineering work in microservices is naming the consistency budget for each business flow. “Inventory must be decremented within 30 seconds of payment success” is a budget. “Don’t worry, it’ll be fine” is not.

The Saga Pattern

Why Sagas

The Problem: A multi-service business operation needs all-or-nothing semantics, but you can’t use a transaction. If step 3 fails, steps 1 and 2 are already committed.

The Solution: A saga is a sequence of local transactions, each with a defined compensating action. If a step fails, you run the compensations for the steps that already succeeded — an explicit, business-aware rollback.

The original saga paper is from 1987 (Garcia-Molina and Salem). The idea predates microservices by decades; microservices just made it suddenly relevant to everyone.

Choreography vs Orchestration

Sagas come in two flavors. Both are real; neither is universally better.

	Choreography	Orchestration
How it works	Each service publishes events; other services subscribe and react	A central orchestrator tells each service what to do
Coupling	Loose — services know nothing about each other	Centralized — orchestrator knows the whole flow
Visibility	Hard to see the flow; logic is scattered	One place to read and trace
Error handling	Each service must know its compensation triggers	Orchestrator decides when to compensate
Best at	Simple 2–3 step flows, naturally event-driven	Complex flows, long-running, branching logic
Tools	Kafka, RabbitMQ, EventBridge	Temporal, AWS Step Functions, Camunda, Conductor

Rule of thumb: anything beyond three steps or with conditional branches gets an orchestrator. Choreography sounds elegant on a whiteboard and turns into an unreadable rats-nest of subscribers in production. The Confluent and Uber engineering blogs have both written extensively about migrating off pure choreography toward orchestration as flows grew.

Saga orchestration in Temporal-style pseudocode

# Temporal-style workflow. Workflow code is deterministic and durable.
# Activities are the calls to actual services; they retry automatically.
from temporalio import workflow, activity
from dataclasses import dataclass

@dataclass
class OrderInput:
    customer_id: str
    items: list
    payment_method_id: str

@workflow.defn
class PlaceOrderSaga:
    @workflow.run
    async def run(self, input: OrderInput) -> str:
        compensations = []          # LIFO stack of cleanups

        try:
            # 1. Create order in PENDING state
            order_id = await workflow.execute_activity(
                create_order, input, start_to_close_timeout="5s",
            )
            compensations.append((cancel_order, order_id))

            # 2. Reserve inventory
            reservation_id = await workflow.execute_activity(
                reserve_stock, order_id, input.items, start_to_close_timeout="5s",
            )
            compensations.append((release_stock, reservation_id))

            # 3. Charge the card (idempotency key = order_id)
            charge_id = await workflow.execute_activity(
                charge_card, order_id, input.payment_method_id,
                start_to_close_timeout="10s",
            )
            compensations.append((refund_card, charge_id))

            # 4. Schedule shipment
            await workflow.execute_activity(
                schedule_shipment, order_id, start_to_close_timeout="5s",
            )

            # 5. Mark order CONFIRMED
            await workflow.execute_activity(
                confirm_order, order_id, start_to_close_timeout="5s",
            )
            return order_id

        except Exception as e:
            # Run compensations in reverse order. Each must be idempotent.
            for compensate, arg in reversed(compensations):
                try:
                    await workflow.execute_activity(
                        compensate, arg, start_to_close_timeout="30s",
                    )
                except Exception as comp_err:
                    workflow.logger.error(f"compensation failed: {comp_err}")
                    # Don't stop — keep trying the rest. Surface to ops if any fail.
            raise e

Compensations are not database rollbacks

A database rollback erases history. A compensation is a new transaction that undoes the business effect of an earlier one. refund_card is not the inverse of charge_card — it is a separate, recorded event with its own audit trail. Compensations must be idempotent (safe to retry) and they must be defined in business terms, not technical ones. “What does the customer see if this step is undone?” is the right question.

Some operations cannot be compensated

Sending an email. Charging a non-refundable fee. Triggering a physical action like “ship the package.” Once these happen, you can’t undo them — you can only do something else that addresses the customer impact. Push these steps to the very end of the saga (so all the recoverable work is done first), or design them to be deferrable until the saga commits. The pattern is sometimes called pivot transactions — the last reversible step before the irreversible ones.

The Outbox Pattern

Why You Need an Outbox

The Problem: Inside a service, you commit a row to your database and publish an event to Kafka. These are two separate systems — the dual-write problem. If the DB commit succeeds and the Kafka publish fails, the rest of the world never hears about the change. You have an orphaned record.

The Solution: Don’t publish to Kafka inside your service at all. Instead, write the event into an outbox table inside the same database transaction as the business change. A separate worker (or change-data-capture tool like Debezium) reads the outbox and publishes to Kafka. One transaction, one source of truth.

The dual-write problem is the most common bug in event-driven microservices. It is also one of the easiest to write and one of the hardest to detect — the missing event just looks like “the downstream system was a little behind today.” The outbox pattern eliminates the entire class of bug.

The outbox table

-- Postgres outbox schema
CREATE TABLE outbox (
    id            UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_id  UUID        NOT NULL,
    aggregate_type TEXT       NOT NULL,        -- "Order", "Payment", etc.
    event_type    TEXT        NOT NULL,        -- "OrderCreated", "OrderShipped"
    payload       JSONB       NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at  TIMESTAMPTZ                            -- NULL until shipped to Kafka
);

CREATE INDEX outbox_unpublished_idx
    ON outbox (created_at)
    WHERE published_at IS NULL;

Inside the service, the business write and the event are in one transaction:

async def create_order(cmd: CreateOrderCommand) -> str:
    async with db.transaction():
        order_id = await db.execute(
            "INSERT INTO orders (id, customer_id, total) "
            "VALUES ($1, $2, $3) RETURNING id",
            uuid.uuid4(), cmd.customer_id, cmd.total,
        )
        await db.execute(
            "INSERT INTO outbox (aggregate_id, aggregate_type, event_type, payload) "
            "VALUES ($1, $2, $3, $4)",
            order_id, "Order", "OrderCreated",
            json.dumps({"order_id": str(order_id), "customer_id": cmd.customer_id, "total": cmd.total}),
        )
        return order_id
    # Both rows commit together or neither does. No event is ever orphaned.

Two ways to ship the outbox to Kafka

Option A: a polling worker. A background process polls the outbox for unpublished rows, publishes them to Kafka, then marks them published_at = now(). Simple, ugly, works.

async def outbox_publisher():
    while True:
        async with db.transaction():
            rows = await db.fetch(
                "SELECT id, aggregate_type, event_type, payload "
                "FROM outbox WHERE published_at IS NULL "
                "ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED"
            )
            for r in rows:
                await kafka.send(
                    topic=f"{r['aggregate_type'].lower()}-events",
                    key=str(r["aggregate_id"]),       # preserves per-aggregate ordering
                    value=r["payload"],
                    headers={"event_type": r["event_type"]},
                )
                await db.execute(
                    "UPDATE outbox SET published_at = now() WHERE id = $1", r["id"],
                )
        await asyncio.sleep(0.5)

Option B: change-data-capture (CDC). Tools like Debezium tail the database’s write-ahead log (WAL on Postgres, binlog on MySQL) and emit a Kafka message for every row change. No polling, sub-second latency, scales to massive throughput. This is what LinkedIn, Wix, and Shopify run in production.

// Debezium connector for the outbox table
{
  "name": "orders-outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "orders-db",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "orders",
    "plugin.name": "pgoutput",
    "slot.name": "orders_outbox_slot",
    "publication.name": "orders_outbox_pub",
    "table.include.list": "public.outbox",
    "tombstones.on.delete": "false",

    // Use the EventRouter SMT to turn one outbox row into a properly-routed event
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "${routedByValue}-events",
    "transforms.outbox.table.field.event.id": "id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.payload": "payload"
  }
}

Outbox is at-least-once. Consumers must be idempotent.

The outbox guarantees an event is published at least once. The publisher can crash after Kafka acknowledges but before it updates published_at — the next worker run will republish that event. Every consumer downstream of an outbox must handle duplicate events safely. The standard tactic: include an event_id in every message and have consumers track the IDs they have processed (in a small processed_events table or a Redis set with TTL).

CQRS (Command Query Responsibility Segregation)

Why Separate Reads from Writes

The Problem: The data shape that’s great for writing (normalized, ACID, validated) is rarely the shape that’s great for reading (denormalized, fast, search-friendly). Forcing both onto one model means both suffer.

The Solution: Build two models. The write model handles commands and enforces invariants. The read model is a denormalized view tuned for query patterns. Keep them in sync via events.

CQRS is one of the most over-applied patterns in this entire space. It is genuinely powerful, and it is also genuinely overkill for 80% of the services that adopt it. The right test is asymmetry: does this service have very different read and write workloads?

CQRS Pays Off When	CQRS Is Overkill When
Reads vastly outnumber writes (100:1 or more)	Read and write rates are similar
Read queries need a different shape (search index, cache, projection)	A single normalized schema serves both fine
Reads can tolerate seconds of staleness	Reads must reflect writes immediately
You want to scale reads independently from writes	Same scaling profile is fine
Different teams own the read and write paths	One team, one service, one model

A pragmatic CQRS setup

Most production CQRS does not look like the textbook. It looks like: writes go to Postgres, an outbox publishes events, and a worker projects those events into a denormalized read store (Elasticsearch for search, Redis for hot reads, materialized views in Postgres itself for reporting).

# Read-model projector. Subscribes to order events and maintains a denormalized view.
async def project_order_events():
    async for msg in kafka_consumer.subscribe("order-events"):
        evt = msg.value
        event_id = msg.headers["event_id"]

        # Idempotency: skip if we've already processed this event_id
        if await read_db.fetchval(
            "SELECT 1 FROM processed_events WHERE event_id = $1", event_id,
        ):
            await msg.ack()
            continue

        async with read_db.transaction():
            if evt["event_type"] == "OrderCreated":
                await read_db.execute(
                    "INSERT INTO order_summary "
                    "(order_id, customer_id, customer_name, total, status, created_at) "
                    "VALUES ($1, $2, $3, $4, 'PENDING', $5)",
                    evt["order_id"],
                    evt["customer_id"],
                    await lookup_customer_name(evt["customer_id"]),  # denormalize on write
                    evt["total"],
                    evt["created_at"],
                )
            elif evt["event_type"] == "OrderShipped":
                await read_db.execute(
                    "UPDATE order_summary SET status = 'SHIPPED', shipped_at = $2 "
                    "WHERE order_id = $1",
                    evt["order_id"], evt["shipped_at"],
                )
            # ... handle other event types

            await read_db.execute(
                "INSERT INTO processed_events (event_id, processed_at) VALUES ($1, now())",
                event_id,
            )
        await msg.ack()

The read model is now a single, fast, denormalized table optimized for the queries the API actually serves. No joins, no aggregations at read time, sub-millisecond response.

CQRS does not give you stronger consistency — it gives you weaker

Many engineers reach for CQRS hoping it solves a consistency problem. It does the opposite. The read model is always behind the write model by some replication lag — usually milliseconds, sometimes seconds, occasionally minutes if the projector falls behind. If your UX shows “you just placed an order” immediately after the write, you have to either (a) read from the write model for that one query, or (b) accept that the user might briefly see no orders. Pick one consciously.

Event Sourcing

Why Events as the Source of Truth

The Problem: A traditional database stores the current state. You lose all history of how it got there. “Why is this account balance $42.17?” becomes archaeology.

The Solution: Store the events — the immutable facts of what happened — and derive current state by replaying them. Current state becomes a projection of an append-only log.

Event sourcing is the CQRS-on-steroids version of data management. It is genuinely beautiful when it fits. It is also operationally heavy, requires schema discipline, and is wrong for most services. The famous warning from Greg Young (who coined the term): “don’t do event sourcing unless you have a really good reason.”

How it works

Commands express intent: PlaceOrder, CancelOrder, ApplyDiscount.
Events are the immutable result: OrderPlaced, OrderCancelled, DiscountApplied.
Events are appended to an event store (EventStoreDB, Kafka with infinite retention, a custom Postgres table).
Current state is rebuilt by folding events: state = events.reduce(apply_event, initial_state).
Projections derive read models from the same event stream.
Snapshots save intermediate state so you don’t replay 10 million events on every read.

Event Sourcing Wins	Event Sourcing Hurts
Audit-heavy domains (banking, healthcare, regulated industries)	CRUD apps that just want to store the current value
Domains where “how did we get here” is a real question	Domains where current state is all anyone cares about
Time-travel queries, replays, what-if analyses	Teams unfamiliar with the pattern (the learning curve is real)
You can fully replay history to fix a projection bug	Schema migration on past events is genuinely painful

Why event sourcing stays rare

Three honest reasons most teams should not adopt it: (1) the cognitive load is enormous — you can’t just SELECT the current state; (2) evolving event schemas after they’re in production is hard, because the events are now part of your history forever; (3) most engineers have only seen CRUD and will fight the model. The right answer for almost every service is “use an outbox + CQRS read models” instead, which gets you 80% of the benefits with 20% of the cost.

Data Replication and Read Models

Why Replicate Data Between Services

The Problem: The order service needs the customer’s shipping address to render an order detail page. Customer data lives in another service. Calling that service on every read is slow, fragile, and creates a hard runtime dependency.

The Solution: Replicate the data you need, not the data you might want. Subscribe to the customer service’s events and maintain a local read model with just the fields you actually use.

This is the workhorse pattern of mature microservices. Most production systems are mostly this: one service publishes events, several other services maintain their own denormalized projections of those events. Each downstream owns the shape of its read model. No one calls anyone else on the synchronous path for read data.

Replicate the fields, not the table

Don’t copy the customer service’s entire schema into the order service. Copy only the fields the order service needs — usually customer_id, display_name, email, maybe a default address. If you need more later, extend the projection. Aggressive replication makes services brittle in exactly the way database-per-service was supposed to fix.

The eventual consistency budget

For every replicated read model, you need to know — explicitly — how stale is acceptable. This is the consistency budget. It belongs in your service’s SLO, your monitoring, and your runbook.

P50 lag: half the events are projected within X ms. Usually 50–500 ms.
P99 lag: 99% within Y seconds. Usually 1–10 s.
Max acceptable lag: beyond this, page someone. The projector is broken.
Lag at restart: after a deploy, how long to catch up? Bound this with parallel consumers or partitioned topics.

Without these numbers, “eventual consistency” is a hand-wave. With them, it’s an SLO you can measure, alert on, and improve.

Dealing with stale reads in the UI

The classic gotcha: user submits a form, gets redirected to the “your changes” page, and sees stale data because the read model hasn’t caught up yet. Three working tactics:

Read-your-writes via session affinity. Route the user’s next read to the write model for a few seconds after a write.
Optimistic update on the client. Show what the user just submitted, then reconcile with the server’s view when it arrives.
Wait-for-projection. The write API returns a version token; the read API blocks (briefly) until the projector has caught up to that version. LinkedIn calls this pattern read-your-writes consistency in their Kafka-based architecture.

Real-World Examples

Amazon: sagas and cell-based isolation

Amazon famously sits on top of thousands of services, each owning its data. Their order pipeline is a textbook saga: place order, reserve inventory, charge card, allocate shipping — with explicit compensations for each step. Internally Amazon also uses cell-based architecture where each service is split into independent cells with their own data; a bug in one cell affects ~1/N of customers instead of 100%. The data layer is what makes that isolation real.

eBay: the eventual consistency story

eBay was one of the earliest large-scale microservices adopters, and their engineering blog from the late 2000s is one of the canonical “here’s why we gave up on distributed transactions” documents. They embraced eventual consistency across their auction, inventory, and payment services and built compensation flows for every cross-service operation. Their core lesson: the cost of strong consistency at their scale was unsustainable; eventual consistency with explicit reconciliation was the only viable path.

LinkedIn: Kafka as the data backbone

LinkedIn invented Kafka largely to solve the data replication problem at scale. The pattern they pioneered: every service publishes its changes as events; downstream services maintain their own read-optimized projections. Confluent later commercialized this as the canonical event-streaming architecture for microservices. LinkedIn also originated much of the “read-your-writes” tooling for projections, because their feed and profile services had to handle “I just edited my profile, where are my changes” queries at billions per day.

Confluent and the data-mesh framing

Confluent has been pushing the framing of data products — each service exposes its data as a versioned, schema-stable, well-documented stream that downstream consumers can subscribe to. This is data mesh applied to microservices: the service team owns not just their code and database, but also the public event stream they publish. The contract is the events, not the database schema.

Uber and Netflix: polyglot persistence at scale

Uber operates on Postgres, MySQL, Cassandra, Redis, DynamoDB, and several internal systems — chosen per workload. Netflix runs DynamoDB, Cassandra, EVCache (their Memcached fork), and Elasticsearch across thousands of services. Both companies will tell you the polyglot story is true but understated — the operational cost of running so many engines is enormous, and they get away with it only because they have entire dedicated teams per database technology. For most companies, the right number of database engines is two or three.

Best Practices

The data-management short list

One database per service. No exceptions. Not even “read-only.” Other services use your API or your events.
Never use distributed transactions across services. 2PC is dead. Use sagas with explicit compensations.
Use the outbox pattern for every event you publish. The dual-write problem is silent and devastating. Outbox eliminates it entirely.
Make every consumer idempotent. At-least-once delivery means duplicates. Track event IDs you’ve processed.
Pick orchestration for sagas with more than 3 steps. Choreography is elegant on a whiteboard and unmaintainable in production. Reach for Temporal or Step Functions.
Adopt CQRS only when reads and writes have asymmetric needs. Otherwise it’s extra moving parts for no gain.
Avoid event sourcing unless you genuinely need full history. The 80/20 alternative is outbox + CQRS read models.
Replicate fields, not tables. Each service’s read model owns only what it actually queries.
Define an explicit consistency budget per replicated read model. “Eventual” without numbers is a bug waiting to happen.
Standardize on as few database engines as you can stand. Polyglot persistence is real; it is also expensive. Default to Postgres + Redis + an object store.
Treat published events as a public API. Version them, document them, never break them. Consumers depend on them like REST callers depend on URLs.
Test the failure paths. Compensations only run during outages — which is the worst time to discover a bug in your “safe” rollback.

The single most useful sentence about data in microservices

If your service can’t answer a request without synchronously calling another service’s database or API, you don’t have microservices — you have a distributed monolith with extra latency. The whole point of these patterns is that each service can keep working when its neighbors are down. Get that right and the rest of microservices starts to pay off.