Why Sagas Matter
Why The Saga Pattern Matters
The Problem: A user clicks “Place Order.” You need to charge the card, reserve inventory, allocate a shipping slot, and award loyalty points — across four services with four databases. If the shipping slot is full after the charge succeeded, you cannot just ROLLBACK. The money already moved.
The Solution: A Saga: a sequence of local transactions, each at one service, where every step that mutates state ships with a compensating action. If step 4 fails, you run the compensations for steps 3, 2, and 1 in reverse to put the world back in a sensible state.
Real Impact: Sagas are how Amazon checks out orders, how Uber dispatches rides, and how any bank that has more than one ledger reconciles transfers. If you have more than one database in your write path, you are already running a saga — the question is whether it’s an accidental one or a designed one.
Real-World Analogy
Booking a vacation involves three separate vendors:
- Book the flight — American Airlines charges your card and locks a seat.
- Book the hotel — Marriott charges your card and reserves a room.
- Book the rental car — Hertz tries to reserve a car — none available.
You cannot tell American “please undo that flight booking, the car was sold out.” The seat sale already happened. What you can do is initiate two new transactions: cancel the flight (refund issued, seat released) and cancel the hotel (refund issued, room released). Those are compensations. They are not rollbacks — they are forward-direction business operations that reverse the visible effect of an earlier commit.
That is exactly what a software saga does between your services.
ACID transactions across services do not exist. Two-phase commit can technically span databases, but at production scale it doesn’t survive contact with reality — the locks, the blocking on a coordinator, the operational fragility. The saga pattern is the dominant alternative: accept that each step is independently committed, and design the reverse path as carefully as you design the forward path.
This tutorial is about how to design that reverse path well, when to use a central orchestrator versus event choreography, and what production systems — Temporal, AWS Step Functions, Camunda, Eventuate — actually do for you.
The Distributed Transaction Problem
Why 2PC Is Not The Answer
The Problem: Two-phase commit (2PC) gives you ACID across multiple resources by adding a coordinator that asks every participant “can you commit?” and only proceeds if all say yes. It works. It also blocks every participant’s row locks for the duration of the prepare phase, and if the coordinator dies after a participant votes “yes,” that participant is stuck holding locks until a human intervenes.
The Solution: Drop the strong consistency goal. Embrace BASE: Basically Available, Soft state, Eventually consistent. Each service commits independently, and you build the reconciliation as part of the business logic, not as an infrastructure guarantee.
Why 2PC fails at scale
| Failure Mode | What Happens | Impact |
|---|---|---|
| Coordinator dies in prepare phase | Participants are blocked with locks held, waiting for the verdict | Unbounded latency until manual recovery |
| Participant slow on prepare | Every other participant holds locks waiting | Hot rows become catastrophic; tail latency goes to seconds |
| Network partition mid-commit | Some participants commit, others don’t — coordinator can’t tell | Inconsistent state until partition heals; sometimes never |
| Adding a 5th service | Probability of a participant being slow or down compounds | System availability is the product of every dependency’s availability |
| Heterogeneous data stores | Postgres + DynamoDB + Kafka + Stripe? No common XA driver. | 2PC isn’t even theoretically applicable |
The deeper issue is philosophical: 2PC tries to make distributed systems look like a single database. The saga pattern accepts they aren’t and works with that fact instead of against it.
BASE and the embrace of partial failure
BASE is the operating model under sagas:
- Basically Available — the system always responds, even if the response is “your order is being processed.”
- Soft state — the state of the system changes over time without external input, because background processes are still finalizing.
- Eventually consistent — given no new updates, the system will reach a consistent state. The question is “how long” — sometimes milliseconds, sometimes minutes, occasionally hours.
Under BASE, partial failure is a normal mode, not an exception. Your code path for “step 3 failed and we need to compensate steps 1 and 2” is just as production-critical as the happy path. If you don’t test it as such, you don’t have a saga — you have a multi-step write that will leave inconsistent state the first time something goes wrong.
Saga = Sequence of Local Transactions + Compensations
The Core Mental Model
The Problem: Each service has its own database. You cannot wrap them in a single transaction.
The Solution: A saga is an ordered list of local transactions T1, T2, …, Tn, each committed independently at one service. For every Ti that mutates state, there is a compensating transaction Ci that semantically undoes it. If Tk fails, the saga runs Ck-1, Ck-2, …, C1 in reverse order.
The original 1987 Sastry & Garcia-Molina paper that defined sagas was about long-lived database transactions. The microservices community borrowed the term and the math: each service’s transaction is the “local” part, and the compensation is the per-service undo.
The shape of an order checkout saga
| Step | Local Transaction (Ti) | Compensation (Ci) |
|---|---|---|
| 1 | OrderService: create order in PENDING state | Mark order as CANCELED |
| 2 | PaymentService: charge card $99.00 | Issue refund of $99.00 |
| 3 | InventoryService: reserve 1x SKU-7 | Release the reservation |
| 4 | ShippingService: allocate carrier slot | Cancel slot allocation |
| 5 | OrderService: mark order CONFIRMED | (no compensation needed — this is the saga’s commit point) |
If step 4 fails — carrier API down — the saga runs C3 (release inventory), C2 (refund $99), C1 (cancel order). Each compensation is itself a local transaction at its respective service. From the customer’s perspective: charge appeared, then refund appeared, then a polite email arrived saying “sorry, we couldn’t fulfill this.” That is a successful saga — one that failed gracefully.
Compensations are business operations, not rollbacks
This is the single most misunderstood thing about sagas. ROLLBACK in SQL erases history — the transaction never happened from any observer’s perspective. A compensation runs after the original commit succeeded. The charge already showed up on the customer’s statement. The compensation issues a refund — a brand-new committed transaction that produces a new line item. Customers can see both. Auditors can see both. Your reporting will show both.
Design compensations with the same rigor as the original action. They have their own latency, their own failure modes, and their own user-visible effects.
Two ways to coordinate the sequence
The local-transactions-plus-compensations model is universal. What varies is who decides the order:
- Choreography — no central brain. Each service publishes events; other services subscribe and react. The sequence emerges from the event topology.
- Orchestration — a central orchestrator process holds the state machine. It tells service A to do its thing, waits for the result, then tells service B, and so on. If anything fails, the orchestrator runs the compensations.
Both are valid. Both are used in production at scale. They have very different operational characteristics, which we’ll cover next.
Choreography Sagas
Why Choreography
The Problem: A central orchestrator is an extra service to build, deploy, and own. Smaller teams want to express “when X happens, Y reacts” without standing up a workflow engine.
The Solution: Choreography. Services publish domain events to a message broker (Kafka, NATS, RabbitMQ, AWS EventBridge). Other services subscribe to the events they care about. The saga is implicit in the event graph.
Choreography in Python with Kafka
# payment_service.py — reacts to OrderCreated, emits PaymentSucceeded or PaymentFailed
from kafka import KafkaConsumer, KafkaProducer
import json
consumer = KafkaConsumer(
"order.events",
bootstrap_servers="kafka:9092",
group_id="payment-service",
enable_auto_commit=False, # commit offset only after side effects succeed
)
producer = KafkaProducer(
bootstrap_servers="kafka:9092",
value_serializer=lambda v: json.dumps(v).encode(),
enable_idempotence=True, # exactly-once writes to the broker
)
for msg in consumer:
event = json.loads(msg.value)
if event["type"] != "OrderCreated":
consumer.commit()
continue
saga_id = event["saga_id"]
# Idempotency: have we already processed this saga step?
if already_processed(saga_id, step="charge"):
consumer.commit()
continue
try:
charge_id = stripe.charge(event["amount"], event["card_token"])
record_processed(saga_id, step="charge", ref=charge_id)
producer.send("order.events", {
"type": "PaymentSucceeded",
"saga_id": saga_id,
"charge_id": charge_id,
"order_id": event["order_id"],
})
except stripe.CardError as e:
producer.send("order.events", {
"type": "PaymentFailed",
"saga_id": saga_id,
"reason": str(e),
})
producer.flush()
consumer.commit()
The order, inventory, and shipping services have the same shape. Each one subscribes to events that are relevant to it, performs its local transaction, and emits the next event. The saga is whatever the union of those subscriptions adds up to.
What choreography gives you
- Loose coupling. No service knows about the others. They only know about the events they consume and produce.
- No single point of failure. The broker is the dependency — and Kafka, NATS, and SQS are all designed to be highly available.
- Easy to add new participants. A new service that wants to award loyalty points just subscribes to
OrderCompleted. No coordinator code changes.
What choreography costs you
The complete saga lives in nobody’s codebase. To answer “what happens when an order is placed” you have to read every service’s subscriber. Debugging a stuck order means tracing events across topics. There is no “status” endpoint — the system’s state is implicit in the events flowing through Kafka. Once you have more than ~5 services participating in a saga, this becomes a real problem.
Orchestration Sagas
Why Orchestration
The Problem: Once a saga has more than a handful of steps, branching logic, conditional compensations, or long-running waits (“wait for the customer to confirm”), choreography starts feeling like reading code by grepping for events.
The Solution: A central orchestrator owns the state machine. It calls each service via RPC or by sending command messages, holds the saga state durably, and runs compensations on failure. The flow is one piece of code you can read top-to-bottom.
Orchestration with Temporal (TypeScript)
Temporal (the open-source successor to Uber’s Cadence) is the dominant production orchestrator. It records every workflow step in durable storage, so if the orchestrator crashes mid-saga, the workflow resumes exactly where it left off.
// orderSaga.ts — a Temporal workflow expressing the entire saga linearly
import { proxyActivities } from "@temporalio/workflow";
import type * as activities from "./activities";
const {
chargeCard, refundCard,
reserveStock, releaseStock,
allocateShipping, cancelShipping,
creditLoyalty,
} = proxyActivities<typeof activities>({
startToCloseTimeout: "30 seconds",
retry: { maximumAttempts: 3, initialInterval: "2s" },
});
export async function orderSaga(order: Order): Promise<OrderResult> {
const compensations: (() => Promise<void>)[] = [];
try {
const chargeId = await chargeCard(order.cardToken, order.amount, order.id);
compensations.unshift(() => refundCard(chargeId, order.id));
const reservationId = await reserveStock(order.sku, order.qty, order.id);
compensations.unshift(() => releaseStock(reservationId, order.id));
const shipmentId = await allocateShipping(order.address, order.id);
compensations.unshift(() => cancelShipping(shipmentId, order.id));
// Loyalty is best-effort — failure here does NOT roll back the order.
try { await creditLoyalty(order.userId, order.amount); }
catch (e) { /* log, alert, continue */ }
return { status: "CONFIRMED", orderId: order.id };
} catch (err) {
// Run compensations in reverse order. Each one is itself retried by Temporal.
for (const compensate of compensations) {
await compensate();
}
return { status: "FAILED", orderId: order.id, reason: String(err) };
}
}
Read that workflow function top-to-bottom and you have the entire business process. The deployment story is also clean: the workflow code runs on Temporal workers; the activities (which call the actual services) run on the same workers. State persistence, retries, timer scheduling, and replay are all handled by the Temporal cluster.
The same saga as AWS Step Functions
If you don’t want to run a Temporal cluster, AWS Step Functions gives you the same model as a managed service. You declare the state machine as JSON, and Step Functions handles persistence and retries.
{
"Comment": "Order checkout saga with compensations",
"StartAt": "ChargeCard",
"States": {
"ChargeCard": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "chargeCard", "Payload.$": "$" },
"ResultPath": "$.charge",
"Next": "ReserveStock",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "FailOrder" }]
},
"ReserveStock": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "reserveStock", "Payload.$": "$" },
"ResultPath": "$.reservation",
"Next": "AllocateShipping",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "RefundCharge" }]
},
"AllocateShipping": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "allocateShipping", "Payload.$": "$" },
"Next": "ConfirmOrder",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "ReleaseStock" }]
},
"ConfirmOrder": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "confirmOrder", "Payload.$": "$" },
"End": true },
"ReleaseStock": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "releaseStock", "Payload.$": "$.reservation" },
"Next": "RefundCharge" },
"RefundCharge": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "refundCard", "Payload.$": "$.charge" },
"Next": "FailOrder" },
"FailOrder": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "failOrder", "Payload.$": "$" },
"End": true }
}
}
Production orchestrator options
- Temporal — open source, code-first workflows in TypeScript / Go / Java / Python. Spun out of Uber’s Cadence project. The default choice for new orchestration in 2026.
- AWS Step Functions — managed service, JSON state machines, deep AWS integration. The default choice on AWS if you don’t want to run infra.
- Camunda — BPMN-driven, popular in regulated enterprise environments where business analysts edit the diagram.
- Apache Airflow — primarily a data pipeline orchestrator, but fine for non-realtime sagas (batch reconciliation, reporting workflows).
- Microsoft Orleans — virtual actors framework with built-in transactions; the .NET answer to Temporal.
- Eventuate Tram / Axon — saga support layered on top of event-sourcing and CQRS frameworks.
Implementing Compensations
Why Compensations Are Hard
The Problem: The forward path of the saga gets all the design attention. Compensations are written last, tested in isolation, and discovered to be wrong during the first real outage.
The Solution: Treat each compensation as a first-class business operation. Design it before you ship the forward step. Make it idempotent. Decide what to do when it itself fails.
Three properties every compensation must have
- Semantically correct. “Refund the charge” is correct. “Delete the row in the payments table” is not — that erases the audit trail and may break downstream reporting.
- Idempotent. The orchestrator may retry the compensation if its first attempt times out. Running “refund $99” twice should not refund $198.
- Commutative with neighbors when possible. If two compensations could run in either order without changing the outcome, you have more retry flexibility.
An idempotent compensation handler
# refund_handler.py — the “C” for the payment step
from dataclasses import dataclass
import stripe
@dataclass
class RefundCommand:
saga_id: str
charge_id: str
amount_cents: int
idempotency_key: str # derived from saga_id + step name
def handle_refund(cmd: RefundCommand) -> str:
# 1. Check the local idempotency table FIRST.
existing = db.fetch_one(
"SELECT refund_id FROM compensation_log WHERE idempotency_key = %s",
cmd.idempotency_key,
)
if existing:
return existing["refund_id"] # already done; safe to return
# 2. Stripe also accepts an idempotency key — pass ours through.
# If Stripe has already seen this key, it returns the original refund.
refund = stripe.Refund.create(
charge=cmd.charge_id,
amount=cmd.amount_cents,
idempotency_key=cmd.idempotency_key,
)
# 3. Record locally so future retries short-circuit at step 1.
db.execute(
"INSERT INTO compensation_log (idempotency_key, saga_id, refund_id, "
"created_at) VALUES (%s, %s, %s, NOW())",
cmd.idempotency_key, cmd.saga_id, refund.id,
)
return refund.id
Notice the layering: the local compensation_log table is the first defense, the upstream provider’s idempotency key is the second. Either alone would mostly work; both together is what survives a Stripe outage in the middle of a refund retry.
When the compensation itself fails
This is the “dead letter” of sagas. The forward path failed, the orchestrator started compensating, and now a compensation is failing too. Options, roughly in order of how often they’re used:
| Strategy | When To Use | Trade-off |
|---|---|---|
| Retry forever with backoff | Compensation is genuinely transient (network blip, deploy) | Saga state hangs “in compensation” until it resolves |
| Park to manual queue | Compensation logic detects an inconsistency that needs human eyes | You need someone on call who can read the queue |
| Forward-only fix | Some operations literally can’t be undone (email already sent) | Decide ahead of time what the “sorry” user experience is |
| Open a support ticket automatically | Compensation involves money that has to be made right | Customer trust depends on follow-up SLA |
Some things have no compensation
You sent the order confirmation email. There is no “unsend.” You shipped the package. There is no “unship” — just a future return. Identify these one-way operations early and place them after the saga’s point of no return, never in the middle. The classic pattern: send the email only after every other step has confirmed success.
Common Failure Modes
Sagas fail in distinctive ways. The patterns below show up in every production saga system; planning for them is the difference between an outage and a slightly-degraded afternoon.
Lost commit acknowledgement
The participant service committed locally, then the network dropped before the orchestrator received the response. The orchestrator thinks the step failed and either retries or starts compensating. If you wired idempotency in at design time, the retry is a no-op and the compensation is correctly preceded by an “is there anything to compensate?” check. If you didn’t, you double-charge a card or release inventory you never reserved.
Long-running sagas with mid-flight customer action
A travel booking saga that says “reserve the seat, hold for 5 minutes for payment confirmation, then book” is a saga that lives for minutes. During those minutes, the customer might cancel, the seat might disappear, the payment provider might 5xx. Your saga state needs to be:
- Persistent — surviving orchestrator restarts (Temporal does this for free; rolling your own means a database table per saga instance).
- Cancelable — an explicit signal the customer can send (“I changed my mind”) that runs compensations in flight.
- Time-boxed — if no progress in N minutes, force-compensate and free held resources.
Hot-spot lookup tables
Every saga reads order_status. Every step writes order_status. Suddenly the orders table is the bottleneck of the entire business. Mitigations: shard by order ID, keep saga state out of the order table (separate saga_state table or workflow engine), use append-only event sourcing instead of in-place status updates.
Cancellation arrives mid-saga
Step 2 just succeeded. Step 3 is in flight. The customer hits “Cancel.” What happens? In a well-designed saga: the cancel is a signal that gets queued; the orchestrator finishes the current step (or aborts it cleanly), then runs compensations for whatever has been done. In a badly designed one: the cancel races with step 3, both partially succeed, and you have an order that is canceled but also shipped.
// Temporal makes this clean — signals are first-class
import { defineSignal, setHandler, condition } from "@temporalio/workflow";
export const cancelSignal = defineSignal("cancel");
export async function orderSaga(order: Order) {
let canceled = false;
setHandler(cancelSignal, () => { canceled = true; });
const compensations: (() => Promise<void>)[] = [];
try {
if (canceled) throw new Error("canceled before start");
const chargeId = await chargeCard(order);
compensations.unshift(() => refundCard(chargeId));
if (canceled) throw new Error("canceled after charge");
const reservationId = await reserveStock(order);
compensations.unshift(() => releaseStock(reservationId));
// ... rest of saga
} catch (err) {
for (const c of compensations) await c();
throw err;
}
}
Choosing Choreography vs Orchestration
Both are correct. The right answer depends on the shape of your system, not on which one is fashionable.
| Factor | Choreography Wins | Orchestration Wins |
|---|---|---|
| Number of services | 2–4 participants | 5+ participants |
| Branching logic | Linear flow, mostly happy path | If/else, parallel branches, loops |
| Long-running waits | Seconds to minutes | Hours, days, human approvals |
| Observability needs | You’re fine reading Kafka | You want a dashboard with state history |
| Team ownership | One team per service, no shared owner | One team owns the end-to-end business process |
| Tooling cost | You already have Kafka | You can run Temporal / pay for Step Functions |
| Compensation complexity | Simple per-event reactions | Multi-step compensations with their own decisions |
| Audit / compliance | Event log is the audit trail | Workflow state history is the audit trail |
The pragmatic rule of thumb
Start with choreography for your first one or two sagas. Once you find yourself debugging an order by tailing four Kafka topics simultaneously, switch that saga to orchestration. Keep the others as-is. Most large systems run a mix — choreography for high-volume simple flows (publish-and-react), orchestration for the high-value complex ones (checkout, onboarding, fraud review).
Real-World Examples
Amazon runs sagas at the scale of its retail business. Werner Vogels has talked publicly about how Amazon’s services use saga-style coordination — each step is a local commit at one service, and compensation is a deliberate forward action. The company’s investment in Step Functions exists in part because they needed an internal-grade orchestrator first.
Uber built Cadence in 2017 to coordinate driver assignment, fare calculation, and trip lifecycle workflows that can span hours (a long ride) or weeks (a multi-step refund). Cadence was open-sourced; the team behind it later left to build Temporal, which is now used by Snap, Coinbase, Datadog, Box, Stripe, and many others. Uber’s production workflows still run on Cadence; Temporal is what most new users adopt.
AWS Step Functions case studies include Capital One (loan origination), Liberty Mutual (claims), and Coca-Cola Andina (logistics). The common pattern: a process with 5–15 steps, each a Lambda or service call, with explicit compensation paths and a need for the legal/compliance team to be able to point at the state machine diagram.
Microsoft Project Orleans introduced “virtual actors with transactions,” embedding saga semantics directly into the actor model. Used internally for Halo presence services and Skype messaging features.
Booking.com and Airbnb have both published on saga-driven reservations — Booking specifically discusses choreography via Kafka, Airbnb has talked about orchestration via internal workflow engines for guest-host coordination.
Eventuate (Chris Richardson’s framework) and Axon Framework are the two libraries that bring saga support to event-sourced JVM applications. They’re used heavily in financial services where event sourcing was already the architectural baseline.
Best Practices
The short list
- Design compensations before you ship the forward path. If you can’t describe how to undo step N, don’t add step N to the saga. Park it after the commit point or rethink the design.
- Make every step idempotent. Use a deterministic idempotency key (saga ID + step name). The participant should be able to receive the same command twice and produce one effect.
- Persist saga state. Either let Temporal / Step Functions / Camunda do it, or have a dedicated
saga_statetable that survives orchestrator restarts. Saga state in memory is saga state you will lose. - Use the outbox pattern. When a service commits its local transaction, write the “next event” into an outbox table in the same transaction. A separate process publishes the outbox to the broker. This is how you avoid “committed locally but failed to publish.”
- Distinguish business failures from infrastructure failures. Card declined ≠ Stripe API down. The first is a saga “failure” that triggers compensation; the second is a transient that retries. Misclassify and you’ll refund customers because Stripe had a 500.
- Cap saga duration. Every saga gets a deadline (5 minutes for checkout, 7 days for refunds). When the deadline expires, force-compensate. Sagas that hang forever are how money gets locked away from customers.
- Instrument every step. Per-step success / failure / latency / retry-count metrics. A dashboard showing “sagas in compensation right now” is the single most useful operational view.
- Run game days for compensations. Inject a failure at step N in staging and watch the compensations fire. Most teams discover their compensation paths are broken the first time they actually run.
- One team owns the saga. Even in a choreography setup, somebody has to own “what happens when the customer places an order.” Without an owner, the saga drifts and nobody’s job is to fix it.
The single most useful sentence about sagas
You will never have a database transaction across services. The saga pattern is not a workaround — it is the acknowledgement that the constraint is real, and the discipline of designing the reverse path with the same care as the forward one. Treat compensations as features, not as failure handling, and your distributed transactions will look surprisingly mundane.