Durable Execution: How Temporal, Restate, and DBOS Are Rethinking Distributed State



Durable Execution: How Temporal, Restate, and DBOS Are Rethinking Distributed State

Here’s a problem every distributed systems engineer knows: you write a multi-step process (charge a card, create a subscription, send a confirmation email, provision infrastructure). The card charge succeeds. Then your service crashes. Now you have a charged customer with no subscription, an unsent email, and zero visibility into where the process stopped.

The traditional solutions — idempotency keys, message queues, saga patterns, compensating transactions — work, but they require significant boilerplate, are error-prone to implement, and leave your business logic buried under infrastructure concerns.

Durable execution is a different approach: what if the framework guaranteed your code would run to completion, automatically resume after crashes, and retry failures — all while maintaining exactly-once semantics?

This is what Temporal, Restate, and DBOS are offering, and in 2026 the space has matured enough to evaluate seriously.

Distributed computing network Photo by Matthias Heyde on Unsplash


The Core Problem With Traditional Approaches

Message Queues + Saga: The Accidental Complexity Tax

A typical saga pattern for an e-commerce order:

# What you want to write:
def process_order(order):
    payment = charge_card(order.card, order.amount)
    subscription = create_subscription(order.user_id, order.plan)
    email = send_confirmation(order.user_id, payment.id)
    infra = provision_resources(order.user_id, order.plan)
    return OrderResult(payment, subscription, email, infra)

# What you actually have to write (saga + compensations + queue handlers):
# 400+ lines of state machine, event handlers, compensation logic,
# idempotency checks, retry logic, timeout handling, dead letter queues...

The saga implementation typically:

  • Defines a state machine with explicit transitions
  • Implements forward steps AND compensating steps for each
  • Handles idempotency by storing processed event IDs
  • Manages timeouts and dead letter queues
  • Provides observability into where any given saga currently sits

This is all necessary — but it’s infrastructure, not business logic. You’re writing plumbing.


Temporal: The Established Leader

Temporal (and its open-source predecessor Cadence from Uber) is the most battle-tested durable execution system. Used at Coinbase, Snap, Netflix, HashiCorp, and hundreds of others.

The Core Mental Model

In Temporal, you write normal code (called a workflow). The framework:

  1. Persists the workflow’s execution history to a durable store
  2. If the worker crashes, replays the history to reconstruct state
  3. Calls to activities (external systems) are retried with backoff automatically
  4. The workflow appears to execute as a single function, even across crashes and days
# temporalio SDK - Python
from datetime import timedelta
from temporalio import workflow, activity
from temporalio.common import RetryPolicy

# Activities: the calls that actually touch external systems
@activity.defn
async def charge_card(card_token: str, amount_cents: int) -> dict:
    # Temporal will retry this on failure with backoff
    result = await payment_gateway.charge(card_token, amount_cents)
    return {"payment_id": result.id, "status": result.status}

@activity.defn
async def create_subscription(user_id: str, plan: str) -> dict:
    sub = await billing_service.create_subscription(user_id, plan)
    return {"subscription_id": sub.id}

@activity.defn
async def send_confirmation_email(user_id: str, payment_id: str) -> bool:
    await email_service.send(user_id, template="order_confirmation", 
                             params={"payment_id": payment_id})
    return True

@activity.defn
async def provision_resources(user_id: str, plan: str) -> dict:
    resources = await infra_service.provision(user_id, plan)
    return {"resource_ids": resources}

# Workflow: your business logic — looks like sequential code
@workflow.defn
class ProcessOrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        
        retry_policy = RetryPolicy(
            maximum_attempts=5,
            initial_interval=timedelta(seconds=1),
            maximum_interval=timedelta(minutes=2),
            backoff_coefficient=2.0,
            non_retryable_error_types=["CardDeclinedError", "InvalidCardError"]
        )
        
        # Each activity call is automatically retried, idempotent, and durable
        payment = await workflow.execute_activity(
            charge_card,
            args=[order.card_token, order.amount_cents],
            schedule_to_close_timeout=timedelta(minutes=5),
            retry_policy=retry_policy
        )
        
        # If we crash here, Temporal will replay history and
        # know charge_card already succeeded — won't re-charge
        
        subscription = await workflow.execute_activity(
            create_subscription,
            args=[order.user_id, order.plan],
            schedule_to_close_timeout=timedelta(minutes=2)
        )
        
        # Run email and infra provisioning in parallel
        email_handle = workflow.execute_activity(
            send_confirmation_email,
            args=[order.user_id, payment["payment_id"]],
            schedule_to_close_timeout=timedelta(minutes=1)
        )
        
        infra_handle = workflow.execute_activity(
            provision_resources,
            args=[order.user_id, order.plan],
            schedule_to_close_timeout=timedelta(minutes=30)
        )
        
        email_result, infra_result = await asyncio.gather(email_handle, infra_handle)
        
        return OrderResult(
            payment_id=payment["payment_id"],
            subscription_id=subscription["subscription_id"],
            resources=infra_result["resource_ids"]
        )

Temporal for Long-Running Processes

Temporal workflows can run for years. This enables patterns that were previously very hard:

@workflow.defn
class SubscriptionLifecycleWorkflow:
    """A workflow that manages the entire lifecycle of a subscription — years potentially."""
    
    @workflow.run
    async def run(self, subscription_id: str, plan: str) -> None:
        
        # Wait for the first billing cycle (30 days)
        await asyncio.sleep(timedelta(days=30).total_seconds())
        
        while True:
            # Charge for renewal
            result = await workflow.execute_activity(
                charge_renewal,
                args=[subscription_id],
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            
            if result["status"] == "failed_permanently":
                # Grace period: try for 7 more days
                for day in range(7):
                    await asyncio.sleep(timedelta(days=1).total_seconds())
                    retry = await workflow.execute_activity(
                        retry_failed_payment,
                        args=[subscription_id]
                    )
                    if retry["status"] == "success":
                        break
                else:
                    # Cancel after grace period
                    await workflow.execute_activity(
                        cancel_subscription,
                        args=[subscription_id, "payment_failed"]
                    )
                    return
            
            # Wait for next billing cycle
            await asyncio.sleep(timedelta(days=30).total_seconds())

This runs durably — if your server restarts, the workflow resumes where it left off. No cron jobs, no database state machines, no timer management.


Restate: The Lightweight Challenger

Restate is newer than Temporal and takes a different approach: rather than a separate cluster, it’s a lightweight sidecar that intercepts your HTTP calls and adds durability.

// Restate - TypeScript SDK
import * as restate from "@restatedev/restate-sdk";

// Define a service
const orderService = restate.service({
  name: "OrderService",
  handlers: {
    
    processOrder: restate.handlers.handler(
      async (ctx: restate.Context, order: Order): Promise<OrderResult> => {
        
        // ctx.run() = durable execution — won't re-execute if we crash
        const payment = await ctx.run("charge-card", async () => {
          return await paymentGateway.charge(order.cardToken, order.amountCents);
        });
        
        // Parallel durable calls
        const [subscription, resources] = await Promise.all([
          ctx.run("create-subscription", () =>
            billingService.createSubscription(order.userId, order.plan)
          ),
          ctx.run("provision-resources", () =>
            infraService.provision(order.userId, order.plan)
          )
        ]);
        
        // Send email in background (fire-and-forget, still durable)
        ctx.send(emailService).sendConfirmation(order.userId, payment.id);
        
        return {
          paymentId: payment.id,
          subscriptionId: subscription.id,
          resourceIds: resources
        };
      }
    )
  }
});

Restate’s key differentiators:

  • No separate cluster required: runs as a sidecar or embedded library
  • Native HTTP/gRPC: works with your existing API stack
  • Virtual objects: stateful actors with automatic persistence
  • Simpler operations: lower operational complexity than Temporal

Restate Virtual Objects

// Stateful actor: each userId gets its own isolated state
const shoppingCart = restate.object({
  name: "ShoppingCart",
  handlers: {
    
    addItem: restate.handlers.object.exclusive(
      async (ctx: restate.ObjectContext, item: CartItem) => {
        // State is durable — survives restarts
        const items = (await ctx.get<CartItem[]>("items")) ?? [];
        items.push(item);
        ctx.set("items", items);
        return { itemCount: items.length };
      }
    ),
    
    checkout: restate.handlers.object.exclusive(
      async (ctx: restate.ObjectContext): Promise<CheckoutResult> => {
        const items = await ctx.get<CartItem[]>("items");
        if (!items?.length) throw new Error("Cart is empty");
        
        // Durable: won't double-charge on retry
        const result = await ctx.run("process-payment", () =>
          paymentService.charge(ctx.key, calculateTotal(items))
        );
        
        // Clear cart after successful checkout
        ctx.clear("items");
        
        // Durably schedule order fulfillment
        await ctx.send(orderService).fulfill(result.orderId);
        
        return result;
      }
    )
  }
});

DBOS: Database-Backed Durable Execution

DBOS takes the most radical approach: instead of a separate workflow engine, it uses your existing PostgreSQL as the durable execution backend. Workflows are stored as database rows; replaying them uses standard SQL.

# DBOS - Python SDK
from dbos import DBOS, workflow, step

dbos = DBOS()

@step()
def charge_card(card_token: str, amount: int) -> dict:
    return payment_gateway.charge(card_token, amount)

@step()
def create_subscription(user_id: str, plan: str) -> dict:
    return billing_service.create_subscription(user_id, plan)

@workflow()
def process_order(order_id: str, card_token: str, 
                  user_id: str, plan: str, amount: int) -> dict:
    
    # Each @step is automatically idempotent and retried on failure
    payment = charge_card(card_token, amount)
    subscription = create_subscription(user_id, plan)
    
    return {
        "order_id": order_id,
        "payment_id": payment["id"],
        "subscription_id": subscription["id"]
    }

DBOS advantages:

  • Zero new infrastructure: uses your existing PostgreSQL
  • Full SQL visibility: query workflow state with regular SQL
  • Transactional: workflow state changes participate in regular DB transactions
  • Lower operational cost: one fewer system to manage

Choosing the Right Framework

FactorTemporalRestateDBOS
MaturityHigh (production at scale)MediumEarly
Operational complexityHigh (Temporal cluster)LowLow
Long-running workflowsExcellentGoodGood
ScaleVery HighHighMedium
Language supportGo, Java, Python, TypeScript, .NET, PHPTypeScript, Kotlin, Java, Go, PythonPython, TypeScript
Best forComplex, long-running workflowsHTTP microservices, greenfieldPostgres-first shops

The Honest Recommendation

  • Established team, complex business processes, can manage infra: Temporal
  • Greenfield microservices, want low operational overhead: Restate
  • PostgreSQL-first, want simplest possible setup: DBOS
  • Just need reliable job queues: Sidekiq/BullMQ/Celery is probably fine

When Durable Execution Is Overkill

Not every problem needs durable execution. Use traditional approaches when:

  • Jobs complete in seconds and occasional duplicates are acceptable
  • Idempotency is naturally handled by your domain (e.g., idempotent HTTP APIs)
  • The saga logic is simple (2-3 steps, easy compensations)
  • Throughput requirements are very high (Temporal has latency overhead per step)

Durable execution shines when: workflows span minutes/hours/days, partial failure is catastrophic, the compensation logic is complex, or you need precise exactly-once semantics across unreliable external services.


Conclusion

Durable execution frameworks represent a genuine architectural shift: moving reliability guarantees from application code into the infrastructure layer. The result is business logic that actually reads like business logic, without the accident of distributed systems complexity layered on top.

Temporal is mature and proven at significant scale. Restate is elegant and simpler to operate. DBOS is intriguing for teams who want the minimum possible operational footprint.

If you’re currently maintaining complex saga implementations, dead-letter queue handlers, or distributed state machines, any of these frameworks likely represents a meaningful reduction in both code complexity and bug surface area. Start with a non-critical workflow, measure the operational overhead vs. the code simplification, and go from there.

The distributed systems problems don’t disappear — they just move to a place where experts have already solved them.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)