Durable Execution: The Pattern That's Quietly Eating Microservices Architecture



Durable Execution: The Pattern That’s Quietly Eating Microservices Architecture

If you’ve ever debugged a distributed transaction that half-failed at 2 AM, you already understand why durable execution is having a moment. The idea is deceptively simple: write your business logic as ordinary code, and let the platform guarantee it runs to completion — even across crashes, restarts, timeouts, and network failures. No message queues. No saga orchestrators. No dead letter queues. Just code that finishes.

Server infrastructure Photo by NASA on Unsplash


The Problem With “Just Use Queues”

The traditional answer to distributed reliability was: publish to a queue, consume idempotently, retry on failure. It works. But it has costs:

  • State is implicit: workflow state is scattered across queue messages, database rows, and service memory
  • Visibility is poor: “where is this order in the pipeline?” requires querying multiple systems
  • Partial failure is hard: if step 3 of 7 fails, do you restart from 1? From 3? What if step 2 had side effects?
  • Timeouts are painful: “wait 24 hours, then send a reminder” requires a scheduler, a cron job, or a timer service
  • Testing is miserable: integration testing a multi-queue pipeline requires a running broker

Durable execution flips the model. You write a workflow function in regular code, and the runtime handles persistence, retries, and replay.


How Durable Execution Works

The core mechanism is event sourcing + deterministic replay:

  1. Every workflow action (activity call, timer, signal) is recorded as an event in a durable history
  2. If the worker crashes, the workflow replays from history to reconstruct in-memory state
  3. Determinism constraints ensure replay produces the same state as the original execution
# Temporal workflow — looks like normal Python
@workflow.defn
class OrderFulfillmentWorkflow:
    @workflow.run
    async def run(self, order: Order) -> FulfillmentResult:
        # Each activity is automatically retried on failure
        payment = await workflow.execute_activity(
            charge_payment,
            order.payment_info,
            start_to_close_timeout=timedelta(seconds=30),
        )
        
        # This timer survives worker restarts
        await asyncio.sleep(timedelta(hours=2))  # wait for warehouse pick
        
        shipment = await workflow.execute_activity(
            ship_order,
            ShipArgs(order_id=order.id, payment_id=payment.id),
            retry_policy=RetryPolicy(maximum_attempts=5),
        )
        
        # Send confirmation — exactly once, even if worker crashes here
        await workflow.execute_activity(send_confirmation_email, shipment)
        
        return FulfillmentResult(shipment_id=shipment.id)

If the worker dies between charge_payment and ship_order, the workflow replays the history, sees that charge_payment already succeeded, skips re-executing it, and continues from the sleep timer. The customer is never double-charged.


The Major Players in 2026

Temporal

The most mature platform. Originally built at Uber, now a managed cloud service. Supports Go, Java, Python, TypeScript, .NET.

Strengths:

  • Battle-tested at massive scale (Coinbase, Netflix, Stripe)
  • Rich SDK ecosystem
  • Strong observability (workflow history, search attributes)
  • temporal server start-dev for local development

Considerations:

  • Self-hosted requires operating a Cassandra/PostgreSQL cluster
  • Temporal Cloud pricing scales with workflow actions
  • Learning curve around determinism constraints

Restate

The new challenger — built for cloud-native, serverless-first architectures. Uses an HTTP/gRPC sidecar model rather than a central server.

// Restate — runs as a regular HTTP handler
const orderWorkflow = restate.workflow({
  name: "OrderFulfillment",
  handlers: {
    run: async (ctx: restate.WorkflowContext, order: Order) => {
      const payment = await ctx.run("charge", () => chargePayment(order));
      
      await ctx.sleep(2 * 60 * 60 * 1000); // 2 hours, durable
      
      const shipment = await ctx.run("ship", () => shipOrder(order.id, payment.id));
      await ctx.run("notify", () => sendConfirmation(shipment));
      
      return { shipmentId: shipment.id };
    }
  }
});

Strengths:

  • Works natively with serverless (Lambda, Cloud Run, Deno Deploy)
  • No central cluster to operate — Restate Cloud or self-hosted lightweight server
  • Lower operational overhead than Temporal

Considerations:

  • Smaller community (but growing fast)
  • Less mature SDK ecosystem

AWS Step Functions (Express + Standard)

AWS-native durable execution. JSON/YAML-defined state machines with 2,000+ integrations.

Strengths:

  • Zero infrastructure to manage
  • Native AWS integration (Lambda, ECS, DynamoDB, etc.)
  • Standard Workflows: exactly-once, unlimited duration
  • Express Workflows: high-throughput, at-least-once

Considerations:

  • State machine DSL is verbose and hard to test locally
  • Limited programming model compared to code-first approaches
  • Vendor lock-in

Cloudflare Workflows

New in 2025, built on top of Durable Objects. Designed specifically for Cloudflare Workers.

export class MyWorkflow extends WorkflowEntrypoint<Env, Params> {
  async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
    const result = await step.do("charge-payment", async () => {
      return await chargePayment(event.payload);
    });
    
    await step.sleep("wait-for-warehouse", "2 hours");
    
    await step.do("ship-order", async () => {
      return await shipOrder(result.paymentId);
    });
  }
}

Strengths:

  • Serverless-native, runs at edge
  • No separate infrastructure
  • Automatic checkpointing

Considerations:

  • Cloudflare Workers ecosystem only
  • New — limited production track record

When to Use Durable Execution

Great Fits

  • Long-running business processes: order fulfillment, onboarding flows, approval chains
  • Human-in-the-loop workflows: “wait for user to confirm email, then proceed”
  • Exactly-once semantics: payment processing, inventory deduction
  • Fan-out/fan-in: process 10,000 items in parallel, aggregate results
  • Scheduled follow-ups: “send reminder if no response in 48 hours”
  • Compensating transactions: saga pattern without the saga framework

Not Ideal For

  • High-throughput, sub-millisecond operations: durable execution adds overhead
  • Simple request/response APIs: no need for durability on a stateless GET
  • Stream processing: use Kafka/Flink for high-volume event streams
  • Anything that needs sub-second reaction: the replay model has latency

Migration Path: From Queue-Based to Durable

You don’t have to rewrite everything. A practical migration:

Phase 1 — Wrap your queue consumers

Take an existing queue-based flow and wrap it in a workflow. The activities call your existing logic. You get observability and retry management immediately.

Phase 2 — Move state into workflow context

Replace the “order status” table that you update per-step with workflow signals and queries. The workflow history becomes your audit log.

Phase 3 — Add long-duration logic

Now that you have durable timers, implement the “retry in 24 hours” and “escalate if not resolved in 3 days” flows that were painful with cron.


Observability Built In

One underrated benefit: every workflow has a full history of what happened and when.

Workflow: order-fulfillment/ord_123abc
Status: Running (step: wait-for-warehouse)
Started: 2026-06-15T08:23:11Z
─────────────────────────────────────
[08:23:11] WorkflowExecutionStarted
[08:23:12] ActivityScheduled: charge-payment
[08:23:12] ActivityStarted: charge-payment
[08:23:13] ActivityCompleted: charge-payment → {paymentId: "pay_xyz"}
[08:23:13] TimerStarted: 2h (fires at 10:23:13)

Debugging production incidents becomes reading a log rather than correlating events across five services.


Determinism: The One Rule You Must Follow

Durable execution workflows must be deterministic — the same event history must always produce the same execution path. Common pitfalls:

# ❌ Non-deterministic — don't do this
@workflow.run
async def run(self):
    # random() changes every replay → breaks history
    delay = random.randint(1, 10)
    await asyncio.sleep(delay)
    
    # datetime.now() changes every replay
    cutoff = datetime.now() - timedelta(days=7)

# ✅ Correct — use workflow APIs
@workflow.run
async def run(self):
    delay = await workflow.execute_activity(get_random_delay)  # captured in history
    await asyncio.sleep(delay)
    
    cutoff = workflow.now() - timedelta(days=7)  # uses replayed time

Conclusion

Durable execution isn’t a replacement for every distributed system pattern — but it’s the right model for a large class of business workflows that teams have historically solved with fragile combinations of queues, cron jobs, status tables, and duct tape.

The 2026 ecosystem is mature enough to adopt in production. If you’re building a new service with multi-step business logic, start here before reaching for RabbitMQ or SQS.


References:


이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)