Saga Pattern Made Easy

The following article explains the Saga pattern, commonly used to design microservice systems.

https://dev.to/temporalio/saga-pattern-made-easy-4j42

So the last time your family went to the local park everyone was talking about the Saga Design Pattern, and now you want to know what it is, if you should make it part of your own distributed system design, and how to implement it. As we all know, software design is all about fashionable1 trends2.

The case for sagas

If you’re wondering if the saga pattern is right for your scenario, ask yourself: does your logic involve multiple steps, some of which span machines, services, shards, or databases, for which partial execution is undesirable? Turns out, this is exactly where sagas are useful. Maybe you are checking inventory, charging a user’s credit card, and then fulfilling the order. Maybe you are managing a supply chain. The saga pattern is helpful because it basically functions as a state machine storing program progress, preventing multiple credit card charges, reverting if necessary, and knowing exactly how to safely resume in a consistent state in the event of power loss.

A common life-based example used to explain the way the saga pattern compensates for failures is trip planning. Suppose you are itching to soak up the rain in Duwamish territory, Seattle. You’ll need to purchase an airplane ticket, reserve a hotel, and get a ticket for a guided backpacking experience on Mount Rainier. All three of these tasks are coupled: if you’re unable to purchase a plane ticket there’s no reason to get the rest. If you get a plane ticket but have nowhere to stay, you’re going to want to cancel that plane reservation (or retry the hotel reservation or find somewhere else to stay). Lastly if you can’t book that backpacking trip, there’s really no other reason to come to Seattle so you might as well cancel the whole thing. (Kidding!)

Above: a simplistic model of compensating in the face of trip planning failures

There are many “do it all, or don’t bother” software applications in the real-world: if you successfully charge the user for an item but your fulfillment service reports that the item is out of stock, you’re going to have upset users if you don’t refund the charge. If you have the opposite problem and accidentally deliver items “for free,” you’ll be out of business. If the machine coordinating a machine learning data processing pipeline crashes but the follower machines carry on processing the data with nowhere to report their data to, you may have a very expensive compute resources bill on your hands3. In all of these cases having some sort of “progress tracking” and compensation code to deal with these “do-it-all-or-don’t-do-any-of-it” tasks is exactly what the saga pattern provides. In saga parlance, these sorts of “all or nothing” tasks are called long-running transactions. This doesn’t necessarily mean such actions run for a “long” time, just that they require more steps in logical time4 than something running locally interacting with a single database.

How do you build a saga?

A saga is composed of two parts:

Defined behavior for “going backwards” if you need to “undo” something (i.e., compensations)
Behavior for striving towards forward progress (i.e., saving state to know where to recover from in the face of failure)

The avid reader of this blog will remember I recently wrote a post about compensating actions. As you can see from above, compensations are but one half of the saga design pattern. The other half, alluded to above, is essentially state management for the whole system. The compensating actions pattern helps you know how to recover if an individual step (or in Temporal terminology, an Activity) fails. But what if the whole system goes down? Where do you start back up? Since not every step might have a compensation attached, you’d be forced to do your best guess based on stored compensations. The saga pattern keeps track of where you are currently so that you can keep driving towards forward progress.

So how do I implement sagas in my own code?

I’m so glad you asked.

leans forward

whispers in ear

This is a little bit of a trick question because by running your code with Temporal, you automatically get your state saved and retries on failure at any level. That means the saga pattern with Temporal is as simple as coding up the compensation you wish to take when a step (Activity) fails. The end.

The _why _behind this magic is Temporal, by design, automatically keeps track of the progress of your program and can pick up where it left off in the face of catastrophic failure. Additionally, Temporal will retry Activities on failure, without you needing to add any code beyond specifying a Retry Policy, e.g.,:

RetryOptions retryoptions = RetryOptions.newBuilder()
       .setInitialInterval(Duration.ofSeconds(1))
       .setMaximumInterval(Duration.ofSeconds(100))
       .setBackoffCoefficient(2)
       .setMaximumAttempts(500).build();