Why Event-Driven Matters
Why Event-Driven Matters
The Problem: Synchronous RPC creates a graph where every service knows every other service it depends on. Add a new consumer of “a user signed up” and you have to modify the producer, redeploy it, coordinate releases, and pray that the new HTTP call doesn’t add latency that breaks the producer’s SLO.
The Solution: Publish a fact — UserSignedUp — to a broker. Anyone who cares subscribes. The producer keeps doing its one job; consumers come and go without the producer ever knowing or caring.
Real Impact: LinkedIn moved from a tangled web of point-to-point integrations to Kafka in 2010 and now processes trillions of events per day. Same pattern at Uber, Netflix, Shopify, Airbnb — once you cross a few dozen services, the synchronous version stops working.
Real-World Analogy
Imagine an office where the only way to share information is by walking up to a colleague and asking them in person. Want to tell ten people something? Ten conversations. New hire? You also have to teach everyone who they need to phone for what.
Now replace that with a bulletin board. You pin a notice. Whoever cares reads it on their schedule. New hires walk up to the same board and learn the same things. You can hire a hundred more people without changing how you communicate.
RPC is the phone calls. Events are the bulletin board. The producer pins the notice once and never has to know who’s reading it.
Tight RPC coupling is invisible at five services and existential at fifty. Every synchronous dependency is a runtime dependency: if it’s down, you’re down. Every new consumer is a producer change. Every release becomes a coordinated dance because two services can never be deployed independently if one calls the other in the request path.
Event-driven architecture inverts the dependency. The producer publishes a fact about its own domain — it has no opinion about who reads it. Consumers subscribe at their own pace, scale independently, fail independently, and can be added or removed without the producer noticing.
What you trade for that decoupling
| You Gain | You Pay With |
|---|---|
| Producers and consumers deploy independently | Eventual consistency — the read side lags the write side |
| New consumers without producer changes | Schema evolution becomes a contract you must manage |
| Natural backpressure and load smoothing | Debugging is harder — no stack trace crosses the broker |
| Replayable history of what happened | Storage and operational cost of running a broker |
| Failure isolation between services | Idempotency and ordering become your problem |
If you don’t need any of the things in the left column, don’t pay for the things in the right column. Event-driven is not a virtue — it’s a tool. Apply it where loose coupling and asynchronous processing actually buy you something.
Events vs Commands vs Messages
Why the Distinction Matters
The Problem: Teams use “event,” “message,” and “command” interchangeably and end up with a broker full of disguised RPC calls. The result behaves like RPC but with worse error handling.
The Solution: Pick the right primitive. Commands and events have different semantics, different ownership, and different failure modes.
The vocabulary is small but precise:
| Concept | Tense | Who Owns It | Cardinality | Example |
|---|---|---|---|---|
| Event | Past tense — immutable fact | The producer | 0..N consumers | OrderPlaced, PaymentCaptured |
| Command | Imperative — a request to do something | The sender, addressed to one receiver | 1 consumer | ChargeCard, ShipOrder |
| Query | Question — expects a synchronous answer | The caller | 1 responder | GetCustomerById |
| Message | Generic envelope | n/a — it’s the wrapper | n/a | The Kafka record / SQS body itself |
An event says “this happened.” It cannot be rejected and cannot be wrong — the past is the past. The producer publishes it because it changed its own state and is informing the world. Other services consume it and decide what to do.
A command says “please do this.” It can be rejected. It is addressed to one specific receiver. The receiver owns the decision about whether the operation succeeds. Commands look like async RPC and are often the right primitive when you’re asking another service to take an action with side effects.
The naming smell test
If your “event” reads as a command in disguise — SendEmail, ChargeCustomer, RebuildIndex — you have a command, not an event. Rename to past tense (EmailRequested, ChargeAuthorized, IndexInvalidated) and the design problem usually surfaces: you were trying to tell one specific service to do one specific thing through a topic that has multiple subscribers.
You can use the same broker for both — Kafka, RabbitMQ, NATS, AWS SNS+SQS, GCP Pub/Sub all transport bytes — but the architectural shape differs. Commands typically use a queue with one consumer group; events typically use a topic with many consumer groups, each independently positioned.
Choreography vs Orchestration
Two Ways to Coordinate
The Problem: A business workflow — place order, charge card, reserve inventory, ship — spans services. Where does the “who runs next” logic live?
The Solution: Either it lives nowhere — each service reacts to events and emits its own (choreography) — or it lives in a dedicated coordinator (orchestration). Both are legitimate. Pick deliberately.
In choreography, each service subscribes to events it cares about and publishes events of its own. There is no central conductor; the workflow emerges from local reactions. This is event-driven in its purest form.
In orchestration, a coordinator service (or workflow engine like Temporal, AWS Step Functions, Camunda) holds the state machine. It calls services, waits for replies, handles retries and compensation, and is the one place in the system where the entire flow is visible.
| Trait | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose — services know events, not each other | Tight to the orchestrator; loose between workers |
| Workflow visibility | Distributed across services and dashboards | One state machine, one place to look |
| Failure handling | Each service handles its own retries / DLQs | Centralized retries, timeouts, compensation |
| Adding a new step | New consumer subscribes — no producer change | Update the workflow definition, redeploy it |
| Debugging | Trace events across many topics | Inspect one workflow execution |
| Best fit | Notifications, fan-out, derived data | Sagas, money flows, multi-step approvals |
Choose orchestration when the failure modes need to be reasoned about in one place — payments, refunds, KYC pipelines, anything you’ll get audited on. Choose choreography when the producer genuinely doesn’t care who’s downstream — analytics fan-out, search index updates, cache invalidations.
Most mature systems use both: choreography for the broad event spine, orchestration for the few critical workflows that span services and need explicit compensation.
An orchestrated workflow in Temporal
from datetime import timedelta
from temporalio import workflow, activity
@activity.defn
async def charge_card(order_id: str, amount: int) -> str:
return await payment_client.charge(order_id, amount)
@activity.defn
async def reserve_inventory(order_id: str) -> None:
await inventory_client.reserve(order_id)
@activity.defn
async def refund(order_id: str) -> None:
await payment_client.refund(order_id)
@workflow.defn
class PlaceOrderWorkflow:
@workflow.run
async def run(self, order_id: str, amount: int) -> str:
# Step 1: take the money. Retries, timeouts, and durable
# state are handled by the workflow engine, not by us.
auth = await workflow.execute_activity(
charge_card, order_id, amount,
start_to_close_timeout=timedelta(seconds=30),
)
try:
await workflow.execute_activity(
reserve_inventory, order_id,
start_to_close_timeout=timedelta(seconds=10),
)
except Exception:
# Compensation: the saga’s undo step
await workflow.execute_activity(refund, order_id)
raise
return auth
Notice what the orchestrator buys you: the entire happy path and the compensation are readable in one file. A choreographed equivalent has the same logic spread across three services and a documentation page nobody updates.
Event Schemas and Evolution
Why Schemas Are Non-Negotiable
The Problem: An event is a contract. The day after you publish “just JSON, we’ll figure it out,” a downstream team writes a parser against today’s shape. Six months and 50 consumers later, renaming a field is a coordinated outage.
The Solution: Define event schemas explicitly, register them, and enforce compatibility on every change. The schema registry exists because human discipline doesn’t scale to dozens of teams.
The two relevant standards in production today:
- CloudEvents (CNCF spec) — a vendor-neutral envelope:
id,source,type,specversion,time,data. Standardizes the metadata so cross-cutting tooling (tracing, routing, dead-letter handling) doesn’t need to know your payload. - Schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Apicurio) — stores Avro / Protobuf / JSON Schema definitions, assigns versions, and enforces compatibility rules at publish time.
A CloudEvent in JSON
{
"specversion": "1.0",
"type": "com.acme.orders.placed.v1",
"source": "/orders-service",
"id": "01HRB6ZQX9N7K8M2J4F5G6H8YT",
"time": "2026-05-12T14:22:31.412Z",
"subject": "order/9821",
"datacontenttype": "application/json",
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"data": {
"orderId": "9821",
"customerId": "c-44012",
"totalCents": 12999,
"currency": "USD",
"items": [
{ "sku": "BK-001", "qty": 2, "priceCents": 5499 },
{ "sku": "BK-014", "qty": 1, "priceCents": 2001 }
]
}
}
The version is in the type name (...placed.v1). When the schema breaks compatibility, you publish to ...placed.v2 and run both topics in parallel until consumers migrate. The version is never silently overloaded.
Compatibility modes
| Mode | What It Allows | When to Use |
|---|---|---|
| Backward | New schema can read old data | Consumers upgrade first — the common case |
| Forward | Old schema can read new data | Producers upgrade first; consumers lag |
| Full | Both backward and forward | Strict environments — safest, most restrictive |
| None | Anything goes | Development only — never production |
In Avro and Protobuf, the rules are simple and enforced by the registry: adding an optional field with a default is fine; removing a required field breaks; renaming breaks; changing a type breaks. The registry rejects the publish before a single consumer ever sees the new shape.
Never publish events without a schema registry once you have more than two consumers
The day someone “just adds a field” without checking compatibility is the day a downstream parser explodes in production at 3 AM. The cost of a registry is one component to operate. The cost of not having one is a rollback every quarter and a paragraph in your incident reviews titled “Events as Untyped JSON.” Avro or Protobuf with Confluent Schema Registry, or JSON Schema with AWS Glue — pick one and enforce it on the publish path.
The cost of breaking changes
Once an event has 50 consumers, a breaking change is a project, not a commit. You publish v2 alongside v1, write a deprecation notice, chase every team to migrate, run both topics in parallel for months, and only then turn off v1. If you skip the schema registry, the “migration” happens during an outage instead of a project plan.
Outbox Pattern and Reliable Publishing
The Dual-Write Problem
The Problem: Your service does two things in one request: write to the database, publish to Kafka. The database commit succeeds, the Kafka publish fails. You now have a row that the rest of the system will never hear about. Or vice versa: publish succeeds, DB commit fails, and consumers act on something that didn’t happen.
The Solution: The transactional outbox. Write the event to an outbox table in the same DB transaction as the business change. A separate worker (or change-data-capture process like Debezium) reads the outbox and publishes to the broker. One transaction, one source of truth, no dual write.
Two writes to two systems in one request is impossible to make atomic without distributed transactions, and distributed transactions are not what you want. The outbox sidesteps the problem by making the broker publish a downstream consequence of a single local transaction.
The outbox table
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate TEXT NOT NULL, -- e.g. 'order'
aggregate_id TEXT NOT NULL, -- partition key for ordering
event_type TEXT NOT NULL, -- 'OrderPlaced.v1'
payload JSONB NOT NULL,
headers JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ
);
CREATE INDEX outbox_unpublished_idx
ON outbox (created_at)
WHERE published_at IS NULL;
The producing transaction
def place_order(req: PlaceOrderRequest) -> OrderId:
with db.transaction() as tx:
order = orders.insert(tx, req) # business write
outbox.insert( # event write — same tx
tx,
aggregate="order",
aggregate_id=order.id,
event_type="OrderPlaced.v1",
payload=order.to_event_payload(),
headers={"traceparent": tracing.current_traceparent()},
)
# Either both rows commit, or neither does. No dual write.
return order.id
The relay worker
def relay_loop():
while True:
batch = db.query("""
SELECT id, aggregate_id, event_type, payload, headers
FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 500
FOR UPDATE SKIP LOCKED
""")
if not batch:
time.sleep(0.1)
continue
for row in batch:
producer.send(
topic="orders.events",
key=row.aggregate_id.encode(), # preserves per-order ordering
value=cloudevent_envelope(row),
headers=[(k, v.encode()) for k, v in row.headers.items()],
)
producer.flush()
db.execute(
"UPDATE outbox SET published_at = now() WHERE id = ANY(%s)",
[r.id for r in batch],
)
The same row may be published twice if the worker crashes after send but before the UPDATE. That is fine, and it is why consumers must be idempotent. The contract is at-least-once delivery; consumers turn that into exactly-once effectively by deduplicating on the event id.
Idempotent consumers, in one shape
The simplest approach: a processed_events table keyed on (consumer_name, event_id) with the business mutation in the same transaction. If the event has already been processed, the insert fails with a unique-key violation and the handler returns. If it hasn’t, both rows commit together. Now retries and duplicates collapse to no-ops, and your “at-least-once” broker becomes “exactly-once” from the consumer’s point of view.
CDC instead of a relay worker
Debezium reads your database’s write-ahead log (Postgres logical decoding, MySQL binlog) and emits a Kafka record for every row change. You can point Debezium at the outbox table and skip the polling worker entirely. The broker stays in sync with the database with no extra moving parts on the application side.
Event Sourcing
When the Events Are the State
The Problem: A normal database stores the current shape of the world. The history of how it got there — who changed what when — is implicit at best, lost at worst.
The Solution: Event sourcing flips the model. The append-only log of events is the source of truth. Current state is derived by replaying events. Snapshots and projections exist for performance, not correctness.
Event sourcing is a much stronger commitment than “we publish events.” In an event-sourced system, the events are stored permanently and any read model is reconstructable by replay. You get a perfect audit log, the ability to rebuild caches and projections from scratch, and the ability to ask questions about the past you didn’t know to ask at the time.
You also pay for it. Every read either replays from the log or hits a projection that is itself eventually consistent. Schema changes apply to events written years ago, so you need an upcasting strategy. Snapshots help with replay performance but add another moving piece to maintain.
| Use Event Sourcing When | Don’t Use It When |
|---|---|
| Audit and history are first-class requirements (finance, compliance) | You just want to publish events — outbox is enough |
| You need to answer historical questions you didn’t ask in advance | Domain is mostly CRUD over current state |
| Domain reasoning is naturally about events (banking, inventory) | Team has no prior experience with the pattern |
| You actively use replay to rebuild read models | You’ll never replay, so the cost buys you nothing |
The honest take: event-driven architecture and event sourcing are different things, and most teams should have the first without the second. Publish events from your services using outbox + a normal DB. Reach for full event sourcing only when you have a domain that genuinely benefits from it, and a team that has done it before.
Operational Concerns
Once events are in production, the failure modes are different from RPC and you need a different operational vocabulary.
Ordering and partitioning
Kafka and Pub/Sub guarantee ordering within a partition, not across the topic. The partition is selected by the message key. If you want all events for a given order to arrive in order, key them by orderId — they hash to the same partition and a single consumer thread processes them sequentially.
Choose the partition key from the unit of consistency you actually need. customerId if customer-level ordering matters. tenantId for multi-tenant isolation. Random if you genuinely don’t care — which is rare.
Hot partitions are a real failure mode
If 80% of your traffic is for one customer (a noisy enterprise tenant, a viral product), keying by customerId sends 80% of your messages to one partition handled by one consumer. Kafka cannot rebalance hot partitions. Either pick a finer-grained key, salt the key with a low-cardinality suffix and accept the looser ordering, or run separate topics per tier.
Dead-letter queues and replay
A consumer that cannot process a message after N retries should not block the partition. Move it to a dead-letter queue (DLQ) — a separate topic for poison messages — and continue with the next message. A human or automation later inspects the DLQ, fixes the underlying issue, and replays the messages.
Replay is the underrated superpower of an event-driven system. Bug in your projection? Reset the consumer offset and rebuild from the start of the topic. New analytics use case? Spin up a consumer at offset zero and let it catch up. None of this is possible if you treated the broker as fire-and-forget.
Schema-version skew
Producers and consumers will run different schema versions in production at the same time, every single day. You don’t coordinate releases across teams — that’s the whole point of decoupling. So consumers must tolerate v1 and v2 events simultaneously during the migration window, and the schema registry’s compatibility rules are what makes that safe.
Idempotency keys end-to-end
The event id from the CloudEvents envelope is your idempotency key. Persist it on the consumer side in the same transaction as the business mutation. Combined with at-least-once delivery, this gives you the “exactly-once effectively” semantics that production systems actually run on. Anyone selling you literal exactly-once across heterogeneous systems is selling you a marketing slide.
The Kafka consumer that handles all of this
from confluent_kafka import Consumer, KafkaError, KafkaException
import json
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "shipping-projector",
"enable.auto.commit": False, # commit only after we persist
"auto.offset.reset": "earliest",
"isolation.level": "read_committed",
})
consumer.subscribe(["orders.events"])
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
raise KafkaException(msg.error())
event = json.loads(msg.value())
try:
with db.transaction() as tx:
# Idempotency: insert into processed_events first.
# Unique-key violation => we’ve seen this event; skip.
tx.execute(
"INSERT INTO processed_events (consumer, event_id) VALUES (%s, %s)",
("shipping-projector", event["id"]),
)
apply_to_projection(tx, event)
consumer.commit(msg)
except UniqueViolation:
consumer.commit(msg) # duplicate; safe to skip
except PoisonMessageError:
send_to_dlq(event) # move on; don’t block partition
consumer.commit(msg)
Real-World Examples
LinkedIn (Kafka’s origin). By 2010, LinkedIn had a tangle of point-to-point integrations between dozens of systems — activity, search, recommendations, analytics. Each new pipeline meant new integrations and brittle ETL. They built Kafka as a unified, replayable log so producers wrote once and any downstream system could subscribe. Today LinkedIn processes trillions of events per day on Kafka and donated it to the Apache Software Foundation in 2011.
Uber’s dispatch. Driver locations, ride requests, ETA updates, surge pricing — all flow as events through Kafka topics keyed by city, driver, or trip. The dispatch decision is a stream-processing job. The synchronous version of that system would have collapsed under traffic; the event-driven version horizontally scales by adding partitions and consumer instances.
Netflix Keystone. Netflix runs one of the largest event pipelines in the world — trillions of events per day flowing into S3, Elasticsearch, Druid, and downstream services. The Keystone platform standardized event publishing across thousands of microservices so any team could declare a stream and let the platform handle delivery, retention, and routing.
Shopify. Shopify runs an internal event bus on Kafka with strict CloudEvents-style schemas and a registry. Every domain event — orders, products, customers — is published once and consumed by analytics, search, fulfillment, fraud detection, and partner webhooks. New consumers attach without producer changes; the producer team owns the schema and the broker, and that’s the whole interface.
Best Practices
The short list
- Name events in past tense. If it sounds like a command, it is one — don’t hide it in a topic.
- Version events in the type, not silently.
OrderPlaced.v1, thenOrderPlaced.v2. Run both in parallel during migrations. - Use a schema registry. Avro or Protobuf with Confluent Schema Registry, or JSON Schema with AWS Glue. Enforce backward compatibility on the publish path.
- Adopt CloudEvents for the envelope. Even on Kafka, the standard metadata pays back the moment you add tracing, routing, or DLQs.
- Outbox, always. Anytime a service writes to a database and publishes an event, use an outbox table. No exceptions.
- Consumers must be idempotent. Persist the event id alongside the business mutation. At-least-once is the only delivery contract worth depending on.
- Key by the unit of consistency. Whatever set of events must be processed in order, use that as the partition key — and watch for hot partitions.
- Build the DLQ before you need it. A consumer with no poison-message escape hatch will eventually wedge a partition during an incident.
- Pick orchestration for money flows. Use Temporal or Step Functions for sagas where compensation must be auditable. Use choreography for fan-out.
- Don’t event-source by default. Event-driven and event-sourced are different. Most teams need the first, not the second.
The single most useful sentence about event-driven systems
An event is a fact your service is publishing about its own domain — not a request, not an instruction, and not addressed to anyone in particular. The day your team internalizes that, the rest of the patterns — choreography, schemas, outbox, idempotency — stop feeling like ceremony and start feeling like the obvious consequences of getting that one definition right.