Event-Driven Architecture

Synchronous RPC between every pair of services is how a distributed system becomes a distributed monolith. Events let producers stop caring who’s listening — and that single shift is what makes a microservice estate scale.

Medium25 min read

Why Event-Driven Matters

Why Event-Driven Matters

The Problem: Synchronous RPC creates a graph where every service knows every other service it depends on. Add a new consumer of “a user signed up” and you have to modify the producer, redeploy it, coordinate releases, and pray that the new HTTP call doesn’t add latency that breaks the producer’s SLO.

The Solution: Publish a fact — UserSignedUp — to a broker. Anyone who cares subscribes. The producer keeps doing its one job; consumers come and go without the producer ever knowing or caring.

Real Impact: LinkedIn moved from a tangled web of point-to-point integrations to Kafka in 2010 and now processes trillions of events per day. Same pattern at Uber, Netflix, Shopify, Airbnb — once you cross a few dozen services, the synchronous version stops working.

Real-World Analogy

Imagine an office where the only way to share information is by walking up to a colleague and asking them in person. Want to tell ten people something? Ten conversations. New hire? You also have to teach everyone who they need to phone for what.

Now replace that with a bulletin board. You pin a notice. Whoever cares reads it on their schedule. New hires walk up to the same board and learn the same things. You can hire a hundred more people without changing how you communicate.

RPC is the phone calls. Events are the bulletin board. The producer pins the notice once and never has to know who’s reading it.

Tight RPC coupling is invisible at five services and existential at fifty. Every synchronous dependency is a runtime dependency: if it’s down, you’re down. Every new consumer is a producer change. Every release becomes a coordinated dance because two services can never be deployed independently if one calls the other in the request path.

Event-driven architecture inverts the dependency. The producer publishes a fact about its own domain — it has no opinion about who reads it. Consumers subscribe at their own pace, scale independently, fail independently, and can be added or removed without the producer noticing.

What you trade for that decoupling

You GainYou Pay With
Producers and consumers deploy independentlyEventual consistency — the read side lags the write side
New consumers without producer changesSchema evolution becomes a contract you must manage
Natural backpressure and load smoothingDebugging is harder — no stack trace crosses the broker
Replayable history of what happenedStorage and operational cost of running a broker
Failure isolation between servicesIdempotency and ordering become your problem

If you don’t need any of the things in the left column, don’t pay for the things in the right column. Event-driven is not a virtue — it’s a tool. Apply it where loose coupling and asynchronous processing actually buy you something.

Events vs Commands vs Messages

Why the Distinction Matters

The Problem: Teams use “event,” “message,” and “command” interchangeably and end up with a broker full of disguised RPC calls. The result behaves like RPC but with worse error handling.

The Solution: Pick the right primitive. Commands and events have different semantics, different ownership, and different failure modes.

The vocabulary is small but precise:

ConceptTenseWho Owns ItCardinalityExample
EventPast tense — immutable factThe producer0..N consumersOrderPlaced, PaymentCaptured
CommandImperative — a request to do somethingThe sender, addressed to one receiver1 consumerChargeCard, ShipOrder
QueryQuestion — expects a synchronous answerThe caller1 responderGetCustomerById
MessageGeneric envelopen/a — it’s the wrappern/aThe Kafka record / SQS body itself

An event says “this happened.” It cannot be rejected and cannot be wrong — the past is the past. The producer publishes it because it changed its own state and is informing the world. Other services consume it and decide what to do.

A command says “please do this.” It can be rejected. It is addressed to one specific receiver. The receiver owns the decision about whether the operation succeeds. Commands look like async RPC and are often the right primitive when you’re asking another service to take an action with side effects.

The naming smell test

If your “event” reads as a command in disguise — SendEmail, ChargeCustomer, RebuildIndex — you have a command, not an event. Rename to past tense (EmailRequested, ChargeAuthorized, IndexInvalidated) and the design problem usually surfaces: you were trying to tell one specific service to do one specific thing through a topic that has multiple subscribers.

You can use the same broker for both — Kafka, RabbitMQ, NATS, AWS SNS+SQS, GCP Pub/Sub all transport bytes — but the architectural shape differs. Commands typically use a queue with one consumer group; events typically use a topic with many consumer groups, each independently positioned.

Choreography vs Orchestration

Two Ways to Coordinate

The Problem: A business workflow — place order, charge card, reserve inventory, ship — spans services. Where does the “who runs next” logic live?

The Solution: Either it lives nowhere — each service reacts to events and emits its own (choreography) — or it lives in a dedicated coordinator (orchestration). Both are legitimate. Pick deliberately.

In choreography, each service subscribes to events it cares about and publishes events of its own. There is no central conductor; the workflow emerges from local reactions. This is event-driven in its purest form.

In orchestration, a coordinator service (or workflow engine like Temporal, AWS Step Functions, Camunda) holds the state machine. It calls services, waits for replies, handles retries and compensation, and is the one place in the system where the entire flow is visible.

Choreography vs Orchestration Choreography Order Payment Inventory Shipping Event Bus (Kafka / Pub-Sub) no central coordinator each service reacts independently Orchestration Orchestrator (Temporal / Step Fn) Order Payment Shipping workflow lives in one place retries, compensation, timeouts centralized
TraitChoreographyOrchestration
CouplingLoose — services know events, not each otherTight to the orchestrator; loose between workers
Workflow visibilityDistributed across services and dashboardsOne state machine, one place to look
Failure handlingEach service handles its own retries / DLQsCentralized retries, timeouts, compensation
Adding a new stepNew consumer subscribes — no producer changeUpdate the workflow definition, redeploy it
DebuggingTrace events across many topicsInspect one workflow execution
Best fitNotifications, fan-out, derived dataSagas, money flows, multi-step approvals

Choose orchestration when the failure modes need to be reasoned about in one place — payments, refunds, KYC pipelines, anything you’ll get audited on. Choose choreography when the producer genuinely doesn’t care who’s downstream — analytics fan-out, search index updates, cache invalidations.

Most mature systems use both: choreography for the broad event spine, orchestration for the few critical workflows that span services and need explicit compensation.

An orchestrated workflow in Temporal

from datetime import timedelta
from temporalio import workflow, activity

@activity.defn
async def charge_card(order_id: str, amount: int) -> str:
    return await payment_client.charge(order_id, amount)

@activity.defn
async def reserve_inventory(order_id: str) -> None:
    await inventory_client.reserve(order_id)

@activity.defn
async def refund(order_id: str) -> None:
    await payment_client.refund(order_id)

@workflow.defn
class PlaceOrderWorkflow:
    @workflow.run
    async def run(self, order_id: str, amount: int) -> str:
        # Step 1: take the money. Retries, timeouts, and durable
        # state are handled by the workflow engine, not by us.
        auth = await workflow.execute_activity(
            charge_card, order_id, amount,
            start_to_close_timeout=timedelta(seconds=30),
        )
        try:
            await workflow.execute_activity(
                reserve_inventory, order_id,
                start_to_close_timeout=timedelta(seconds=10),
            )
        except Exception:
            # Compensation: the saga’s undo step
            await workflow.execute_activity(refund, order_id)
            raise
        return auth

Notice what the orchestrator buys you: the entire happy path and the compensation are readable in one file. A choreographed equivalent has the same logic spread across three services and a documentation page nobody updates.

Event Schemas and Evolution

Why Schemas Are Non-Negotiable

The Problem: An event is a contract. The day after you publish “just JSON, we’ll figure it out,” a downstream team writes a parser against today’s shape. Six months and 50 consumers later, renaming a field is a coordinated outage.

The Solution: Define event schemas explicitly, register them, and enforce compatibility on every change. The schema registry exists because human discipline doesn’t scale to dozens of teams.

The two relevant standards in production today:

A CloudEvent in JSON

{
  "specversion": "1.0",
  "type": "com.acme.orders.placed.v1",
  "source": "/orders-service",
  "id": "01HRB6ZQX9N7K8M2J4F5G6H8YT",
  "time": "2026-05-12T14:22:31.412Z",
  "subject": "order/9821",
  "datacontenttype": "application/json",
  "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
  "data": {
    "orderId": "9821",
    "customerId": "c-44012",
    "totalCents": 12999,
    "currency": "USD",
    "items": [
      { "sku": "BK-001", "qty": 2, "priceCents": 5499 },
      { "sku": "BK-014", "qty": 1, "priceCents": 2001 }
    ]
  }
}

The version is in the type name (...placed.v1). When the schema breaks compatibility, you publish to ...placed.v2 and run both topics in parallel until consumers migrate. The version is never silently overloaded.

Compatibility modes

ModeWhat It AllowsWhen to Use
BackwardNew schema can read old dataConsumers upgrade first — the common case
ForwardOld schema can read new dataProducers upgrade first; consumers lag
FullBoth backward and forwardStrict environments — safest, most restrictive
NoneAnything goesDevelopment only — never production

In Avro and Protobuf, the rules are simple and enforced by the registry: adding an optional field with a default is fine; removing a required field breaks; renaming breaks; changing a type breaks. The registry rejects the publish before a single consumer ever sees the new shape.

Never publish events without a schema registry once you have more than two consumers

The day someone “just adds a field” without checking compatibility is the day a downstream parser explodes in production at 3 AM. The cost of a registry is one component to operate. The cost of not having one is a rollback every quarter and a paragraph in your incident reviews titled “Events as Untyped JSON.” Avro or Protobuf with Confluent Schema Registry, or JSON Schema with AWS Glue — pick one and enforce it on the publish path.

The cost of breaking changes

Once an event has 50 consumers, a breaking change is a project, not a commit. You publish v2 alongside v1, write a deprecation notice, chase every team to migrate, run both topics in parallel for months, and only then turn off v1. If you skip the schema registry, the “migration” happens during an outage instead of a project plan.

Outbox Pattern and Reliable Publishing

The Dual-Write Problem

The Problem: Your service does two things in one request: write to the database, publish to Kafka. The database commit succeeds, the Kafka publish fails. You now have a row that the rest of the system will never hear about. Or vice versa: publish succeeds, DB commit fails, and consumers act on something that didn’t happen.

The Solution: The transactional outbox. Write the event to an outbox table in the same DB transaction as the business change. A separate worker (or change-data-capture process like Debezium) reads the outbox and publishes to the broker. One transaction, one source of truth, no dual write.

Two writes to two systems in one request is impossible to make atomic without distributed transactions, and distributed transactions are not what you want. The outbox sidesteps the problem by making the broker publish a downstream consequence of a single local transaction.

The outbox table

CREATE TABLE outbox (
    id           UUID        PRIMARY KEY,
    aggregate    TEXT        NOT NULL,         -- e.g. 'order'
    aggregate_id TEXT        NOT NULL,         -- partition key for ordering
    event_type   TEXT        NOT NULL,         -- 'OrderPlaced.v1'
    payload      JSONB       NOT NULL,
    headers      JSONB       NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at TIMESTAMPTZ
);

CREATE INDEX outbox_unpublished_idx
    ON outbox (created_at)
    WHERE published_at IS NULL;

The producing transaction

def place_order(req: PlaceOrderRequest) -> OrderId:
    with db.transaction() as tx:
        order = orders.insert(tx, req)            # business write
        outbox.insert(                            # event write — same tx
            tx,
            aggregate="order",
            aggregate_id=order.id,
            event_type="OrderPlaced.v1",
            payload=order.to_event_payload(),
            headers={"traceparent": tracing.current_traceparent()},
        )
    # Either both rows commit, or neither does. No dual write.
    return order.id

The relay worker

def relay_loop():
    while True:
        batch = db.query("""
            SELECT id, aggregate_id, event_type, payload, headers
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY created_at
            LIMIT 500
            FOR UPDATE SKIP LOCKED
        """)
        if not batch:
            time.sleep(0.1)
            continue

        for row in batch:
            producer.send(
                topic="orders.events",
                key=row.aggregate_id.encode(),     # preserves per-order ordering
                value=cloudevent_envelope(row),
                headers=[(k, v.encode()) for k, v in row.headers.items()],
            )
        producer.flush()
        db.execute(
            "UPDATE outbox SET published_at = now() WHERE id = ANY(%s)",
            [r.id for r in batch],
        )

The same row may be published twice if the worker crashes after send but before the UPDATE. That is fine, and it is why consumers must be idempotent. The contract is at-least-once delivery; consumers turn that into exactly-once effectively by deduplicating on the event id.

Idempotent consumers, in one shape

The simplest approach: a processed_events table keyed on (consumer_name, event_id) with the business mutation in the same transaction. If the event has already been processed, the insert fails with a unique-key violation and the handler returns. If it hasn’t, both rows commit together. Now retries and duplicates collapse to no-ops, and your “at-least-once” broker becomes “exactly-once” from the consumer’s point of view.

CDC instead of a relay worker

Debezium reads your database’s write-ahead log (Postgres logical decoding, MySQL binlog) and emits a Kafka record for every row change. You can point Debezium at the outbox table and skip the polling worker entirely. The broker stays in sync with the database with no extra moving parts on the application side.

Event Sourcing

When the Events Are the State

The Problem: A normal database stores the current shape of the world. The history of how it got there — who changed what when — is implicit at best, lost at worst.

The Solution: Event sourcing flips the model. The append-only log of events is the source of truth. Current state is derived by replaying events. Snapshots and projections exist for performance, not correctness.

Event sourcing is a much stronger commitment than “we publish events.” In an event-sourced system, the events are stored permanently and any read model is reconstructable by replay. You get a perfect audit log, the ability to rebuild caches and projections from scratch, and the ability to ask questions about the past you didn’t know to ask at the time.

You also pay for it. Every read either replays from the log or hits a projection that is itself eventually consistent. Schema changes apply to events written years ago, so you need an upcasting strategy. Snapshots help with replay performance but add another moving piece to maintain.

Use Event Sourcing WhenDon’t Use It When
Audit and history are first-class requirements (finance, compliance)You just want to publish events — outbox is enough
You need to answer historical questions you didn’t ask in advanceDomain is mostly CRUD over current state
Domain reasoning is naturally about events (banking, inventory)Team has no prior experience with the pattern
You actively use replay to rebuild read modelsYou’ll never replay, so the cost buys you nothing

The honest take: event-driven architecture and event sourcing are different things, and most teams should have the first without the second. Publish events from your services using outbox + a normal DB. Reach for full event sourcing only when you have a domain that genuinely benefits from it, and a team that has done it before.

Operational Concerns

Once events are in production, the failure modes are different from RPC and you need a different operational vocabulary.

Ordering and partitioning

Kafka and Pub/Sub guarantee ordering within a partition, not across the topic. The partition is selected by the message key. If you want all events for a given order to arrive in order, key them by orderId — they hash to the same partition and a single consumer thread processes them sequentially.

Choose the partition key from the unit of consistency you actually need. customerId if customer-level ordering matters. tenantId for multi-tenant isolation. Random if you genuinely don’t care — which is rare.

Hot partitions are a real failure mode

If 80% of your traffic is for one customer (a noisy enterprise tenant, a viral product), keying by customerId sends 80% of your messages to one partition handled by one consumer. Kafka cannot rebalance hot partitions. Either pick a finer-grained key, salt the key with a low-cardinality suffix and accept the looser ordering, or run separate topics per tier.

Dead-letter queues and replay

A consumer that cannot process a message after N retries should not block the partition. Move it to a dead-letter queue (DLQ) — a separate topic for poison messages — and continue with the next message. A human or automation later inspects the DLQ, fixes the underlying issue, and replays the messages.

Replay is the underrated superpower of an event-driven system. Bug in your projection? Reset the consumer offset and rebuild from the start of the topic. New analytics use case? Spin up a consumer at offset zero and let it catch up. None of this is possible if you treated the broker as fire-and-forget.

Schema-version skew

Producers and consumers will run different schema versions in production at the same time, every single day. You don’t coordinate releases across teams — that’s the whole point of decoupling. So consumers must tolerate v1 and v2 events simultaneously during the migration window, and the schema registry’s compatibility rules are what makes that safe.

Idempotency keys end-to-end

The event id from the CloudEvents envelope is your idempotency key. Persist it on the consumer side in the same transaction as the business mutation. Combined with at-least-once delivery, this gives you the “exactly-once effectively” semantics that production systems actually run on. Anyone selling you literal exactly-once across heterogeneous systems is selling you a marketing slide.

The Kafka consumer that handles all of this

from confluent_kafka import Consumer, KafkaError, KafkaException
import json

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "shipping-projector",
    "enable.auto.commit": False,    # commit only after we persist
    "auto.offset.reset": "earliest",
    "isolation.level": "read_committed",
})
consumer.subscribe(["orders.events"])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        if msg.error().code() == KafkaError._PARTITION_EOF:
            continue
        raise KafkaException(msg.error())

    event = json.loads(msg.value())
    try:
        with db.transaction() as tx:
            # Idempotency: insert into processed_events first.
            # Unique-key violation => we’ve seen this event; skip.
            tx.execute(
                "INSERT INTO processed_events (consumer, event_id) VALUES (%s, %s)",
                ("shipping-projector", event["id"]),
            )
            apply_to_projection(tx, event)
        consumer.commit(msg)
    except UniqueViolation:
        consumer.commit(msg)              # duplicate; safe to skip
    except PoisonMessageError:
        send_to_dlq(event)                # move on; don’t block partition
        consumer.commit(msg)

Real-World Examples

LinkedIn (Kafka’s origin). By 2010, LinkedIn had a tangle of point-to-point integrations between dozens of systems — activity, search, recommendations, analytics. Each new pipeline meant new integrations and brittle ETL. They built Kafka as a unified, replayable log so producers wrote once and any downstream system could subscribe. Today LinkedIn processes trillions of events per day on Kafka and donated it to the Apache Software Foundation in 2011.

Uber’s dispatch. Driver locations, ride requests, ETA updates, surge pricing — all flow as events through Kafka topics keyed by city, driver, or trip. The dispatch decision is a stream-processing job. The synchronous version of that system would have collapsed under traffic; the event-driven version horizontally scales by adding partitions and consumer instances.

Netflix Keystone. Netflix runs one of the largest event pipelines in the world — trillions of events per day flowing into S3, Elasticsearch, Druid, and downstream services. The Keystone platform standardized event publishing across thousands of microservices so any team could declare a stream and let the platform handle delivery, retention, and routing.

Shopify. Shopify runs an internal event bus on Kafka with strict CloudEvents-style schemas and a registry. Every domain event — orders, products, customers — is published once and consumed by analytics, search, fulfillment, fraud detection, and partner webhooks. New consumers attach without producer changes; the producer team owns the schema and the broker, and that’s the whole interface.

Best Practices

The short list

  • Name events in past tense. If it sounds like a command, it is one — don’t hide it in a topic.
  • Version events in the type, not silently. OrderPlaced.v1, then OrderPlaced.v2. Run both in parallel during migrations.
  • Use a schema registry. Avro or Protobuf with Confluent Schema Registry, or JSON Schema with AWS Glue. Enforce backward compatibility on the publish path.
  • Adopt CloudEvents for the envelope. Even on Kafka, the standard metadata pays back the moment you add tracing, routing, or DLQs.
  • Outbox, always. Anytime a service writes to a database and publishes an event, use an outbox table. No exceptions.
  • Consumers must be idempotent. Persist the event id alongside the business mutation. At-least-once is the only delivery contract worth depending on.
  • Key by the unit of consistency. Whatever set of events must be processed in order, use that as the partition key — and watch for hot partitions.
  • Build the DLQ before you need it. A consumer with no poison-message escape hatch will eventually wedge a partition during an incident.
  • Pick orchestration for money flows. Use Temporal or Step Functions for sagas where compensation must be auditable. Use choreography for fan-out.
  • Don’t event-source by default. Event-driven and event-sourced are different. Most teams need the first, not the second.

The single most useful sentence about event-driven systems

An event is a fact your service is publishing about its own domain — not a request, not an instruction, and not addressed to anyone in particular. The day your team internalizes that, the rest of the patterns — choreography, schemas, outbox, idempotency — stop feeling like ceremony and start feeling like the obvious consequences of getting that one definition right.