🗺️Design Patterns·6 min read

Saga Pattern

Coordinate a multi-service workflow with compensating actions instead of one big transaction.

A business workflow often spans several services: order, payment, inventory, shipping, email, loyalty points, and fraud checks. Each service owns its own database, so you cannot wrap the whole workflow in one ACID transaction. The Saga pattern coordinates the work as a sequence of local transactions, with explicit compensating actions that undo completed steps when a later step fails.

🔭Think of it like…

Booking a vacation is a saga. You reserve a flight, then a hotel, then a car. There is no single database transaction across the airline, hotel, and rental company. If the car is unavailable, you cancel the hotel and flight. Those cancellations are compensations: real business actions that move the world back toward a consistent outcome.

The problem: no one ACID transaction across services

In a monolith with one database, checkout can be one transaction: insert order, decrement inventory, insert payment, commit. In a microservice architecture, those records live behind different services and databases. Holding a global lock across all of them would be slow, fragile, and often impossible. Two-phase commit exists, but many modern services and message brokers avoid it because it couples availability to every participant.

why a single transaction does not fit

Order DB transaction starts
  → call Payment service       # remote network call
  → call Inventory service     # remote network call
  → call Shipping service      # remote network call
Order DB transaction commits?

Problems:
  - remote calls can hang while locks are held
  - each service owns its own database
  - one participant outage blocks the whole transaction
  - rollback cannot undo external side effects like a card authorization

The saga answer is to accept that the workflow will move through intermediate states. The system becomes eventually consistentinstead of instantly atomic. Users may briefly see "payment captured, shipment pending", but the workflow has a clear path to completion or compensation.

The core idea

A saga is a state machine of local transactions. Each forward step commits in one service. If a later step fails, compensating steps run in reverse order for the steps that already committed.

How it works: local transactions plus compensations

Every saga step needs two pieces of design: the forward action and the compensation. The compensation is not always a perfect undo. A refund does not erase the fact that a charge happened; it creates a new financial event that balances it. A release-inventory action does not pretend the reservation never existed; it makes the units available again.

saga shape

Step 1: create order        Compensation: cancel order
Step 2: authorize payment  Compensation: void/refund payment
Step 3: reserve inventory  Compensation: release inventory
Step 4: create shipment    Compensation: cancel shipment if possible

If step 3 fails:
  run compensation for step 2
  run compensation for step 1
  mark saga failed with a reason

Compensations must be idempotent

Sagas run over unreliable networks and queues, so both forward steps and compensations may be retried. A refund command might be delivered twice. A cancel-order command might race with an already-cancelled order. Each handler should use an idempotency key or a state check so repeated attempts have the same final effect.

Forward step	Compensation	Idempotency rule
Create pending order	Cancel order	Cancel only if not already completed or cancelled
Capture payment	Refund payment	Use one refund key per payment capture
Reserve inventory	Release reservation	Release by reservation id once
Create shipment	Cancel shipment	No-op if carrier already cancelled it

Choreography vs orchestration

There are two common ways to drive a saga. In choreography, services react to events and publish the next event. In orchestration, a coordinator commands each step and records the saga state. Both can work; the trade-off is where the workflow logic lives.

Dimension	Choreography	Orchestration
Control flow	Emerges from services reacting to events	Explicit in a coordinator
Coupling	No central workflow service	Services depend on coordinator commands
Visibility	Harder to see the whole flow in one place	One state machine shows progress and failures
Best for	Simple flows with few steps and low branching	Complex flows with many failure paths
Failure handling	Each service must know what event to publish next	Coordinator decides retry and compensation order

orchestrated order saga

OrderSagaCoordinator:
  start(orderId):
    create_order(orderId)
    authorize_payment(orderId)
    reserve_inventory(orderId)
    create_shipment(orderId)
    mark_order_confirmed(orderId)

  on_failure(failed_step):
    compensate_completed_steps_in_reverse_order()
    mark_order_failed(orderId)

Beginner rule of thumb

Use choreography for small, obvious flows. Prefer orchestration once you need branching, timeouts, human-visible status, retries, and clear debugging. A coordinator makes the saga state explicit.

Worked example: order → payment → inventory → shipping

Imagine an ecommerce checkout. The business wants either a confirmed order with payment, inventory, and shipment, or a clean failure where the customer is not charged and inventory is not stuck. The saga records each step, retries transient failures, and compensates permanent failures.

successful order saga

1. Order service:
     local transaction → create order {status: "PENDING"}
     publish OrderCreated

2. Payment service:
     local transaction → authorize/capture payment
     publish PaymentCaptured

3. Inventory service:
     local transaction → reserve SKU units
     publish InventoryReserved

4. Shipping service:
     local transaction → create shipment label
     publish ShipmentCreated

5. Order service:
     local transaction → mark order {status: "CONFIRMED"}

failure and compensation

Inventory reservation fails after payment succeeded:

forward history:
  order created
  payment captured
  inventory reserve failed

compensation:
  refund payment using refund idempotency key
  cancel order
  notify user that checkout failed

final state:
  order = FAILED
  payment = REFUNDED
  inventory = unchanged

Notice that the saga does not hide intermediate states. For a short period, the order may be pending while payment is captured and inventory is not yet reserved. That is acceptable only if the product and support teams understand the states and the system has repair jobs for stuck sagas.

Reliability: events, outbox, retries, and timeouts

Sagas need reliable messaging. If a service commits its local database transaction but crashes before publishing the event that starts the next step, the saga can get stuck. The common fix is the outbox/CDC pattern: write the business row and an outbox event in the same transaction, then publish the event asynchronously.

Retries: transient failures should retry with backoff and jitter. Permanent business failures should trigger compensation.
Timeouts: a saga step that never responds needs a deadline. After the deadline, the coordinator retries, queries state, or compensates.
Idempotent handlers: every command and event handler should tolerate duplicate delivery.
Observability: store saga ID, current step, completed steps, compensation status, retry count, and last error.

Compensation is a business decision

Some actions cannot be fully undone. A shipped package may require a return process, not a simple cancel. A sent email cannot be unsent. Design compensations with product, finance, and support teams, not only with code.

Trade-offs, gotchas, and when to avoid sagas

Sagas buy availability and service autonomy at the cost of complexity. They are powerful when a workflow truly crosses service boundaries, but they are not a reason to split a simple transaction across services too early.

Concern	What can go wrong	Mitigation
Intermediate states	Users see pending or inconsistent status	Model states explicitly and explain them in UI
Duplicate messages	A step or compensation runs twice	Use idempotency keys and state checks
Lost events	Saga stalls after a local commit	Use outbox/CDC and reconciliation jobs
Compensation failure	Rollback path also fails	Retry, alert, and provide manual repair tools
Debugging	Events span many services	Trace with saga ID and central status views

Keep a single ACID transaction if all data belongs in one service and the transaction is short.
Use a saga when steps are independently owned, long-running, or have external side effects.
Build admin repair tools before production. Eventually consistent systems need operational escape hatches.

Key takeaways

A saga coordinates a multi-service workflow as local transactions plus compensating actions, not one global ACID transaction.
Choreography uses events between services; orchestration uses a coordinator that records state and commands each step.
Compensations are real business actions, often imperfect, and must be idempotent because they may be retried.
Sagas are eventually consistent: intermediate states are normal, so model them explicitly and make them observable.
Use outbox/CDC, idempotency keys, retries, timeouts, tracing, and repair jobs to make sagas reliable in production.

Each service owns its own database and may call external systems. Holding locks across remote calls would be slow and fragile, and many side effects cannot be rolled back by a database abort. A saga commits each local step and uses compensating actions if a later step fails.

It should compensate the completed payment step, usually by refunding or voiding the payment with an idempotent refund key, then cancel or fail the order and notify the user. The final state is consistent even though the system passed through a temporary paid-but-not-reserved state.

Orchestration is easier when the workflow has many steps, branches, retries, timeouts, and compensations. The coordinator provides one place to see the current state and decide what happens after each success or failure.

Finished this lesson?

Mark it complete to track your progress through the workbook.