The Saga Pattern | LIZIU Microservices

Why Sagas Matter

Why The Saga Pattern Matters

The Problem: A user clicks “Place Order.” You need to charge the card, reserve inventory, allocate a shipping slot, and award loyalty points — across four services with four databases. If the shipping slot is full after the charge succeeded, you cannot just ROLLBACK. The money already moved.

The Solution: A Saga: a sequence of local transactions, each at one service, where every step that mutates state ships with a compensating action. If step 4 fails, you run the compensations for steps 3, 2, and 1 in reverse to put the world back in a sensible state.

Real Impact: Sagas are how Amazon checks out orders, how Uber dispatches rides, and how any bank that has more than one ledger reconciles transfers. If you have more than one database in your write path, you are already running a saga — the question is whether it’s an accidental one or a designed one.

Real-World Analogy

Booking a vacation involves three separate vendors:

Book the flight — American Airlines charges your card and locks a seat.
Book the hotel — Marriott charges your card and reserves a room.
Book the rental car — Hertz tries to reserve a car — none available.

You cannot tell American “please undo that flight booking, the car was sold out.” The seat sale already happened. What you can do is initiate two new transactions: cancel the flight (refund issued, seat released) and cancel the hotel (refund issued, room released). Those are compensations. They are not rollbacks — they are forward-direction business operations that reverse the visible effect of an earlier commit.

That is exactly what a software saga does between your services.

ACID transactions across services do not exist. Two-phase commit can technically span databases, but at production scale it doesn’t survive contact with reality — the locks, the blocking on a coordinator, the operational fragility. The saga pattern is the dominant alternative: accept that each step is independently committed, and design the reverse path as carefully as you design the forward path.

This tutorial is about how to design that reverse path well, when to use a central orchestrator versus event choreography, and what production systems — Temporal, AWS Step Functions, Camunda, Eventuate — actually do for you.

The Distributed Transaction Problem

Why 2PC Is Not The Answer

The Problem: Two-phase commit (2PC) gives you ACID across multiple resources by adding a coordinator that asks every participant “can you commit?” and only proceeds if all say yes. It works. It also blocks every participant’s row locks for the duration of the prepare phase, and if the coordinator dies after a participant votes “yes,” that participant is stuck holding locks until a human intervenes.

The Solution: Drop the strong consistency goal. Embrace BASE: Basically Available, Soft state, Eventually consistent. Each service commits independently, and you build the reconciliation as part of the business logic, not as an infrastructure guarantee.

Why 2PC fails at scale

Failure Mode	What Happens	Impact
Coordinator dies in prepare phase	Participants are blocked with locks held, waiting for the verdict	Unbounded latency until manual recovery
Participant slow on prepare	Every other participant holds locks waiting	Hot rows become catastrophic; tail latency goes to seconds
Network partition mid-commit	Some participants commit, others don’t — coordinator can’t tell	Inconsistent state until partition heals; sometimes never
Adding a 5th service	Probability of a participant being slow or down compounds	System availability is the product of every dependency’s availability
Heterogeneous data stores	Postgres + DynamoDB + Kafka + Stripe? No common XA driver.	2PC isn’t even theoretically applicable

The deeper issue is philosophical: 2PC tries to make distributed systems look like a single database. The saga pattern accepts they aren’t and works with that fact instead of against it.

BASE and the embrace of partial failure

BASE is the operating model under sagas:

Basically Available — the system always responds, even if the response is “your order is being processed.”
Soft state — the state of the system changes over time without external input, because background processes are still finalizing.
Eventually consistent — given no new updates, the system will reach a consistent state. The question is “how long” — sometimes milliseconds, sometimes minutes, occasionally hours.

Under BASE, partial failure is a normal mode, not an exception. Your code path for “step 3 failed and we need to compensate steps 1 and 2” is just as production-critical as the happy path. If you don’t test it as such, you don’t have a saga — you have a multi-step write that will leave inconsistent state the first time something goes wrong.

Saga = Sequence of Local Transactions + Compensations

The Core Mental Model

The Problem: Each service has its own database. You cannot wrap them in a single transaction.

The Solution: A saga is an ordered list of local transactions T₁, T₂, …, T_n, each committed independently at one service. For every T_i that mutates state, there is a compensating transaction C_i that semantically undoes it. If T_k fails, the saga runs C_k-1, C_k-2, …, C₁ in reverse order.

The original 1987 Sastry & Garcia-Molina paper that defined sagas was about long-lived database transactions. The microservices community borrowed the term and the math: each service’s transaction is the “local” part, and the compensation is the per-service undo.

The shape of an order checkout saga

Step	Local Transaction (T_i)	Compensation (C_i)
1	OrderService: create order in PENDING state	Mark order as CANCELED
2	PaymentService: charge card $99.00	Issue refund of $99.00
3	InventoryService: reserve 1x SKU-7	Release the reservation
4	ShippingService: allocate carrier slot	Cancel slot allocation
5	OrderService: mark order CONFIRMED	(no compensation needed — this is the saga’s commit point)

If step 4 fails — carrier API down — the saga runs C₃ (release inventory), C₂ (refund $99), C₁ (cancel order). Each compensation is itself a local transaction at its respective service. From the customer’s perspective: charge appeared, then refund appeared, then a polite email arrived saying “sorry, we couldn’t fulfill this.” That is a successful saga — one that failed gracefully.

Compensations are business operations, not rollbacks

This is the single most misunderstood thing about sagas. ROLLBACK in SQL erases history — the transaction never happened from any observer’s perspective. A compensation runs after the original commit succeeded. The charge already showed up on the customer’s statement. The compensation issues a refund — a brand-new committed transaction that produces a new line item. Customers can see both. Auditors can see both. Your reporting will show both.

Design compensations with the same rigor as the original action. They have their own latency, their own failure modes, and their own user-visible effects.

Two ways to coordinate the sequence

The local-transactions-plus-compensations model is universal. What varies is who decides the order:

Choreography — no central brain. Each service publishes events; other services subscribe and react. The sequence emerges from the event topology.
Orchestration — a central orchestrator process holds the state machine. It tells service A to do its thing, waits for the result, then tells service B, and so on. If anything fails, the orchestrator runs the compensations.

Both are valid. Both are used in production at scale. They have very different operational characteristics, which we’ll cover next.

Choreography Sagas

Why Choreography

The Problem: A central orchestrator is an extra service to build, deploy, and own. Smaller teams want to express “when X happens, Y reacts” without standing up a workflow engine.

The Solution: Choreography. Services publish domain events to a message broker (Kafka, NATS, RabbitMQ, AWS EventBridge). Other services subscribe to the events they care about. The saga is implicit in the event graph.

Choreography in Python with Kafka

# payment_service.py — reacts to OrderCreated, emits PaymentSucceeded or PaymentFailed
from kafka import KafkaConsumer, KafkaProducer
import json

consumer = KafkaConsumer(
    "order.events",
    bootstrap_servers="kafka:9092",
    group_id="payment-service",
    enable_auto_commit=False,    # commit offset only after side effects succeed
)
producer = KafkaProducer(
    bootstrap_servers="kafka:9092",
    value_serializer=lambda v: json.dumps(v).encode(),
    enable_idempotence=True,        # exactly-once writes to the broker
)

for msg in consumer:
    event = json.loads(msg.value)
    if event["type"] != "OrderCreated":
        consumer.commit()
        continue

    saga_id = event["saga_id"]
    # Idempotency: have we already processed this saga step?
    if already_processed(saga_id, step="charge"):
        consumer.commit()
        continue

    try:
        charge_id = stripe.charge(event["amount"], event["card_token"])
        record_processed(saga_id, step="charge", ref=charge_id)
        producer.send("order.events", {
            "type": "PaymentSucceeded",
            "saga_id": saga_id,
            "charge_id": charge_id,
            "order_id": event["order_id"],
        })
    except stripe.CardError as e:
        producer.send("order.events", {
            "type": "PaymentFailed",
            "saga_id": saga_id,
            "reason": str(e),
        })
    producer.flush()
    consumer.commit()

The order, inventory, and shipping services have the same shape. Each one subscribes to events that are relevant to it, performs its local transaction, and emits the next event. The saga is whatever the union of those subscriptions adds up to.

What choreography gives you

Loose coupling. No service knows about the others. They only know about the events they consume and produce.
No single point of failure. The broker is the dependency — and Kafka, NATS, and SQS are all designed to be highly available.
Easy to add new participants. A new service that wants to award loyalty points just subscribes to OrderCompleted. No coordinator code changes.

What choreography costs you

The complete saga lives in nobody’s codebase. To answer “what happens when an order is placed” you have to read every service’s subscriber. Debugging a stuck order means tracing events across topics. There is no “status” endpoint — the system’s state is implicit in the events flowing through Kafka. Once you have more than ~5 services participating in a saga, this becomes a real problem.

Orchestration Sagas

Why Orchestration

The Problem: Once a saga has more than a handful of steps, branching logic, conditional compensations, or long-running waits (“wait for the customer to confirm”), choreography starts feeling like reading code by grepping for events.

The Solution: A central orchestrator owns the state machine. It calls each service via RPC or by sending command messages, holds the saga state durably, and runs compensations on failure. The flow is one piece of code you can read top-to-bottom.

Orchestration with Temporal (TypeScript)

Temporal (the open-source successor to Uber’s Cadence) is the dominant production orchestrator. It records every workflow step in durable storage, so if the orchestrator crashes mid-saga, the workflow resumes exactly where it left off.

// orderSaga.ts — a Temporal workflow expressing the entire saga linearly
import { proxyActivities } from "@temporalio/workflow";
import type * as activities from "./activities";

const {
    chargeCard, refundCard,
    reserveStock, releaseStock,
    allocateShipping, cancelShipping,
    creditLoyalty,
} = proxyActivities<typeof activities>({
    startToCloseTimeout: "30 seconds",
    retry: { maximumAttempts: 3, initialInterval: "2s" },
});

export async function orderSaga(order: Order): Promise<OrderResult> {
    const compensations: (() => Promise<void>)[] = [];

    try {
        const chargeId = await chargeCard(order.cardToken, order.amount, order.id);
        compensations.unshift(() => refundCard(chargeId, order.id));

        const reservationId = await reserveStock(order.sku, order.qty, order.id);
        compensations.unshift(() => releaseStock(reservationId, order.id));

        const shipmentId = await allocateShipping(order.address, order.id);
        compensations.unshift(() => cancelShipping(shipmentId, order.id));

        // Loyalty is best-effort — failure here does NOT roll back the order.
        try { await creditLoyalty(order.userId, order.amount); }
        catch (e) { /* log, alert, continue */ }

        return { status: "CONFIRMED", orderId: order.id };
    } catch (err) {
        // Run compensations in reverse order. Each one is itself retried by Temporal.
        for (const compensate of compensations) {
            await compensate();
        }
        return { status: "FAILED", orderId: order.id, reason: String(err) };
    }
}

Read that workflow function top-to-bottom and you have the entire business process. The deployment story is also clean: the workflow code runs on Temporal workers; the activities (which call the actual services) run on the same workers. State persistence, retries, timer scheduling, and replay are all handled by the Temporal cluster.

The same saga as AWS Step Functions

If you don’t want to run a Temporal cluster, AWS Step Functions gives you the same model as a managed service. You declare the state machine as JSON, and Step Functions handles persistence and retries.

{
  "Comment": "Order checkout saga with compensations",
  "StartAt": "ChargeCard",
  "States": {
    "ChargeCard": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "chargeCard", "Payload.$": "$" },
      "ResultPath": "$.charge",
      "Next": "ReserveStock",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "FailOrder" }]
    },
    "ReserveStock": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "reserveStock", "Payload.$": "$" },
      "ResultPath": "$.reservation",
      "Next": "AllocateShipping",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "RefundCharge" }]
    },
    "AllocateShipping": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "allocateShipping", "Payload.$": "$" },
      "Next": "ConfirmOrder",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "ReleaseStock" }]
    },
    "ConfirmOrder":    { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
                         "Parameters": { "FunctionName": "confirmOrder", "Payload.$": "$" },
                         "End": true },
    "ReleaseStock":    { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
                         "Parameters": { "FunctionName": "releaseStock", "Payload.$": "$.reservation" },
                         "Next": "RefundCharge" },
    "RefundCharge":    { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
                         "Parameters": { "FunctionName": "refundCard", "Payload.$": "$.charge" },
                         "Next": "FailOrder" },
    "FailOrder":       { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
                         "Parameters": { "FunctionName": "failOrder", "Payload.$": "$" },
                         "End": true }
  }
}

Production orchestrator options

Temporal — open source, code-first workflows in TypeScript / Go / Java / Python. Spun out of Uber’s Cadence project. The default choice for new orchestration in 2026.
AWS Step Functions — managed service, JSON state machines, deep AWS integration. The default choice on AWS if you don’t want to run infra.
Camunda — BPMN-driven, popular in regulated enterprise environments where business analysts edit the diagram.
Apache Airflow — primarily a data pipeline orchestrator, but fine for non-realtime sagas (batch reconciliation, reporting workflows).
Microsoft Orleans — virtual actors framework with built-in transactions; the .NET answer to Temporal.
Eventuate Tram / Axon — saga support layered on top of event-sourcing and CQRS frameworks.

Implementing Compensations

Why Compensations Are Hard

The Problem: The forward path of the saga gets all the design attention. Compensations are written last, tested in isolation, and discovered to be wrong during the first real outage.

The Solution: Treat each compensation as a first-class business operation. Design it before you ship the forward step. Make it idempotent. Decide what to do when it itself fails.

Three properties every compensation must have

Semantically correct. “Refund the charge” is correct. “Delete the row in the payments table” is not — that erases the audit trail and may break downstream reporting.
Idempotent. The orchestrator may retry the compensation if its first attempt times out. Running “refund $99” twice should not refund $198.
Commutative with neighbors when possible. If two compensations could run in either order without changing the outcome, you have more retry flexibility.

An idempotent compensation handler

# refund_handler.py — the “C” for the payment step
from dataclasses import dataclass
import stripe

@dataclass
class RefundCommand:
    saga_id: str
    charge_id: str
    amount_cents: int
    idempotency_key: str   # derived from saga_id + step name

def handle_refund(cmd: RefundCommand) -> str:
    # 1. Check the local idempotency table FIRST.
    existing = db.fetch_one(
        "SELECT refund_id FROM compensation_log WHERE idempotency_key = %s",
        cmd.idempotency_key,
    )
    if existing:
        return existing["refund_id"]   # already done; safe to return

    # 2. Stripe also accepts an idempotency key — pass ours through.
    #    If Stripe has already seen this key, it returns the original refund.
    refund = stripe.Refund.create(
        charge=cmd.charge_id,
        amount=cmd.amount_cents,
        idempotency_key=cmd.idempotency_key,
    )

    # 3. Record locally so future retries short-circuit at step 1.
    db.execute(
        "INSERT INTO compensation_log (idempotency_key, saga_id, refund_id, "
        "created_at) VALUES (%s, %s, %s, NOW())",
        cmd.idempotency_key, cmd.saga_id, refund.id,
    )
    return refund.id

Notice the layering: the local compensation_log table is the first defense, the upstream provider’s idempotency key is the second. Either alone would mostly work; both together is what survives a Stripe outage in the middle of a refund retry.

When the compensation itself fails

This is the “dead letter” of sagas. The forward path failed, the orchestrator started compensating, and now a compensation is failing too. Options, roughly in order of how often they’re used:

Strategy	When To Use	Trade-off
Retry forever with backoff	Compensation is genuinely transient (network blip, deploy)	Saga state hangs “in compensation” until it resolves
Park to manual queue	Compensation logic detects an inconsistency that needs human eyes	You need someone on call who can read the queue
Forward-only fix	Some operations literally can’t be undone (email already sent)	Decide ahead of time what the “sorry” user experience is
Open a support ticket automatically	Compensation involves money that has to be made right	Customer trust depends on follow-up SLA

Some things have no compensation

You sent the order confirmation email. There is no “unsend.” You shipped the package. There is no “unship” — just a future return. Identify these one-way operations early and place them after the saga’s point of no return, never in the middle. The classic pattern: send the email only after every other step has confirmed success.

Common Failure Modes

Sagas fail in distinctive ways. The patterns below show up in every production saga system; planning for them is the difference between an outage and a slightly-degraded afternoon.

Lost commit acknowledgement

The participant service committed locally, then the network dropped before the orchestrator received the response. The orchestrator thinks the step failed and either retries or starts compensating. If you wired idempotency in at design time, the retry is a no-op and the compensation is correctly preceded by an “is there anything to compensate?” check. If you didn’t, you double-charge a card or release inventory you never reserved.

Long-running sagas with mid-flight customer action

A travel booking saga that says “reserve the seat, hold for 5 minutes for payment confirmation, then book” is a saga that lives for minutes. During those minutes, the customer might cancel, the seat might disappear, the payment provider might 5xx. Your saga state needs to be:

Persistent — surviving orchestrator restarts (Temporal does this for free; rolling your own means a database table per saga instance).
Cancelable — an explicit signal the customer can send (“I changed my mind”) that runs compensations in flight.
Time-boxed — if no progress in N minutes, force-compensate and free held resources.

Hot-spot lookup tables

Every saga reads order_status. Every step writes order_status. Suddenly the orders table is the bottleneck of the entire business. Mitigations: shard by order ID, keep saga state out of the order table (separate saga_state table or workflow engine), use append-only event sourcing instead of in-place status updates.

Cancellation arrives mid-saga

Step 2 just succeeded. Step 3 is in flight. The customer hits “Cancel.” What happens? In a well-designed saga: the cancel is a signal that gets queued; the orchestrator finishes the current step (or aborts it cleanly), then runs compensations for whatever has been done. In a badly designed one: the cancel races with step 3, both partially succeed, and you have an order that is canceled but also shipped.

// Temporal makes this clean — signals are first-class
import { defineSignal, setHandler, condition } from "@temporalio/workflow";

export const cancelSignal = defineSignal("cancel");

export async function orderSaga(order: Order) {
    let canceled = false;
    setHandler(cancelSignal, () => { canceled = true; });

    const compensations: (() => Promise<void>)[] = [];

    try {
        if (canceled) throw new Error("canceled before start");
        const chargeId = await chargeCard(order);
        compensations.unshift(() => refundCard(chargeId));

        if (canceled) throw new Error("canceled after charge");
        const reservationId = await reserveStock(order);
        compensations.unshift(() => releaseStock(reservationId));

        // ... rest of saga
    } catch (err) {
        for (const c of compensations) await c();
        throw err;
    }
}

Choosing Choreography vs Orchestration

Both are correct. The right answer depends on the shape of your system, not on which one is fashionable.

Factor	Choreography Wins	Orchestration Wins
Number of services	2–4 participants	5+ participants
Branching logic	Linear flow, mostly happy path	If/else, parallel branches, loops
Long-running waits	Seconds to minutes	Hours, days, human approvals
Observability needs	You’re fine reading Kafka	You want a dashboard with state history
Team ownership	One team per service, no shared owner	One team owns the end-to-end business process
Tooling cost	You already have Kafka	You can run Temporal / pay for Step Functions
Compensation complexity	Simple per-event reactions	Multi-step compensations with their own decisions
Audit / compliance	Event log is the audit trail	Workflow state history is the audit trail

The pragmatic rule of thumb

Start with choreography for your first one or two sagas. Once you find yourself debugging an order by tailing four Kafka topics simultaneously, switch that saga to orchestration. Keep the others as-is. Most large systems run a mix — choreography for high-volume simple flows (publish-and-react), orchestration for the high-value complex ones (checkout, onboarding, fraud review).

Real-World Examples

Amazon runs sagas at the scale of its retail business. Werner Vogels has talked publicly about how Amazon’s services use saga-style coordination — each step is a local commit at one service, and compensation is a deliberate forward action. The company’s investment in Step Functions exists in part because they needed an internal-grade orchestrator first.

Uber built Cadence in 2017 to coordinate driver assignment, fare calculation, and trip lifecycle workflows that can span hours (a long ride) or weeks (a multi-step refund). Cadence was open-sourced; the team behind it later left to build Temporal, which is now used by Snap, Coinbase, Datadog, Box, Stripe, and many others. Uber’s production workflows still run on Cadence; Temporal is what most new users adopt.

AWS Step Functions case studies include Capital One (loan origination), Liberty Mutual (claims), and Coca-Cola Andina (logistics). The common pattern: a process with 5–15 steps, each a Lambda or service call, with explicit compensation paths and a need for the legal/compliance team to be able to point at the state machine diagram.

Microsoft Project Orleans introduced “virtual actors with transactions,” embedding saga semantics directly into the actor model. Used internally for Halo presence services and Skype messaging features.

Booking.com and Airbnb have both published on saga-driven reservations — Booking specifically discusses choreography via Kafka, Airbnb has talked about orchestration via internal workflow engines for guest-host coordination.

Eventuate (Chris Richardson’s framework) and Axon Framework are the two libraries that bring saga support to event-sourced JVM applications. They’re used heavily in financial services where event sourcing was already the architectural baseline.

Best Practices

The short list

Design compensations before you ship the forward path. If you can’t describe how to undo step N, don’t add step N to the saga. Park it after the commit point or rethink the design.
Make every step idempotent. Use a deterministic idempotency key (saga ID + step name). The participant should be able to receive the same command twice and produce one effect.
Persist saga state. Either let Temporal / Step Functions / Camunda do it, or have a dedicated saga_state table that survives orchestrator restarts. Saga state in memory is saga state you will lose.
Use the outbox pattern. When a service commits its local transaction, write the “next event” into an outbox table in the same transaction. A separate process publishes the outbox to the broker. This is how you avoid “committed locally but failed to publish.”
Distinguish business failures from infrastructure failures. Card declined ≠ Stripe API down. The first is a saga “failure” that triggers compensation; the second is a transient that retries. Misclassify and you’ll refund customers because Stripe had a 500.
Cap saga duration. Every saga gets a deadline (5 minutes for checkout, 7 days for refunds). When the deadline expires, force-compensate. Sagas that hang forever are how money gets locked away from customers.
Instrument every step. Per-step success / failure / latency / retry-count metrics. A dashboard showing “sagas in compensation right now” is the single most useful operational view.
Run game days for compensations. Inject a failure at step N in staging and watch the compensations fire. Most teams discover their compensation paths are broken the first time they actually run.
One team owns the saga. Even in a choreography setup, somebody has to own “what happens when the customer places an order.” Without an owner, the saga drifts and nobody’s job is to fix it.

The single most useful sentence about sagas

You will never have a database transaction across services. The saga pattern is not a workaround — it is the acknowledgement that the constraint is real, and the discipline of designing the reverse path with the same care as the forward one. Treat compensations as features, not as failure handling, and your distributed transactions will look surprisingly mundane.