Durable Execution: The Pattern That's Quietly Eating Microservices Architecture
on Architecture, Temporal, Microservices, Distributedsystems, Cloud
Durable Execution: The Pattern That’s Quietly Eating Microservices Architecture
If you’ve ever debugged a distributed transaction that half-failed at 2 AM, you already understand why durable execution is having a moment. The idea is deceptively simple: write your business logic as ordinary code, and let the platform guarantee it runs to completion — even across crashes, restarts, timeouts, and network failures. No message queues. No saga orchestrators. No dead letter queues. Just code that finishes.
Photo by NASA on Unsplash
The Problem With “Just Use Queues”
The traditional answer to distributed reliability was: publish to a queue, consume idempotently, retry on failure. It works. But it has costs:
- State is implicit: workflow state is scattered across queue messages, database rows, and service memory
- Visibility is poor: “where is this order in the pipeline?” requires querying multiple systems
- Partial failure is hard: if step 3 of 7 fails, do you restart from 1? From 3? What if step 2 had side effects?
- Timeouts are painful: “wait 24 hours, then send a reminder” requires a scheduler, a cron job, or a timer service
- Testing is miserable: integration testing a multi-queue pipeline requires a running broker
Durable execution flips the model. You write a workflow function in regular code, and the runtime handles persistence, retries, and replay.
How Durable Execution Works
The core mechanism is event sourcing + deterministic replay:
- Every workflow action (activity call, timer, signal) is recorded as an event in a durable history
- If the worker crashes, the workflow replays from history to reconstruct in-memory state
- Determinism constraints ensure replay produces the same state as the original execution
# Temporal workflow — looks like normal Python
@workflow.defn
class OrderFulfillmentWorkflow:
@workflow.run
async def run(self, order: Order) -> FulfillmentResult:
# Each activity is automatically retried on failure
payment = await workflow.execute_activity(
charge_payment,
order.payment_info,
start_to_close_timeout=timedelta(seconds=30),
)
# This timer survives worker restarts
await asyncio.sleep(timedelta(hours=2)) # wait for warehouse pick
shipment = await workflow.execute_activity(
ship_order,
ShipArgs(order_id=order.id, payment_id=payment.id),
retry_policy=RetryPolicy(maximum_attempts=5),
)
# Send confirmation — exactly once, even if worker crashes here
await workflow.execute_activity(send_confirmation_email, shipment)
return FulfillmentResult(shipment_id=shipment.id)
If the worker dies between charge_payment and ship_order, the workflow replays the history, sees that charge_payment already succeeded, skips re-executing it, and continues from the sleep timer. The customer is never double-charged.
The Major Players in 2026
Temporal
The most mature platform. Originally built at Uber, now a managed cloud service. Supports Go, Java, Python, TypeScript, .NET.
Strengths:
- Battle-tested at massive scale (Coinbase, Netflix, Stripe)
- Rich SDK ecosystem
- Strong observability (workflow history, search attributes)
temporal server start-devfor local development
Considerations:
- Self-hosted requires operating a Cassandra/PostgreSQL cluster
- Temporal Cloud pricing scales with workflow actions
- Learning curve around determinism constraints
Restate
The new challenger — built for cloud-native, serverless-first architectures. Uses an HTTP/gRPC sidecar model rather than a central server.
// Restate — runs as a regular HTTP handler
const orderWorkflow = restate.workflow({
name: "OrderFulfillment",
handlers: {
run: async (ctx: restate.WorkflowContext, order: Order) => {
const payment = await ctx.run("charge", () => chargePayment(order));
await ctx.sleep(2 * 60 * 60 * 1000); // 2 hours, durable
const shipment = await ctx.run("ship", () => shipOrder(order.id, payment.id));
await ctx.run("notify", () => sendConfirmation(shipment));
return { shipmentId: shipment.id };
}
}
});
Strengths:
- Works natively with serverless (Lambda, Cloud Run, Deno Deploy)
- No central cluster to operate — Restate Cloud or self-hosted lightweight server
- Lower operational overhead than Temporal
Considerations:
- Smaller community (but growing fast)
- Less mature SDK ecosystem
AWS Step Functions (Express + Standard)
AWS-native durable execution. JSON/YAML-defined state machines with 2,000+ integrations.
Strengths:
- Zero infrastructure to manage
- Native AWS integration (Lambda, ECS, DynamoDB, etc.)
- Standard Workflows: exactly-once, unlimited duration
- Express Workflows: high-throughput, at-least-once
Considerations:
- State machine DSL is verbose and hard to test locally
- Limited programming model compared to code-first approaches
- Vendor lock-in
Cloudflare Workflows
New in 2025, built on top of Durable Objects. Designed specifically for Cloudflare Workers.
export class MyWorkflow extends WorkflowEntrypoint<Env, Params> {
async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
const result = await step.do("charge-payment", async () => {
return await chargePayment(event.payload);
});
await step.sleep("wait-for-warehouse", "2 hours");
await step.do("ship-order", async () => {
return await shipOrder(result.paymentId);
});
}
}
Strengths:
- Serverless-native, runs at edge
- No separate infrastructure
- Automatic checkpointing
Considerations:
- Cloudflare Workers ecosystem only
- New — limited production track record
When to Use Durable Execution
Great Fits
- Long-running business processes: order fulfillment, onboarding flows, approval chains
- Human-in-the-loop workflows: “wait for user to confirm email, then proceed”
- Exactly-once semantics: payment processing, inventory deduction
- Fan-out/fan-in: process 10,000 items in parallel, aggregate results
- Scheduled follow-ups: “send reminder if no response in 48 hours”
- Compensating transactions: saga pattern without the saga framework
Not Ideal For
- High-throughput, sub-millisecond operations: durable execution adds overhead
- Simple request/response APIs: no need for durability on a stateless GET
- Stream processing: use Kafka/Flink for high-volume event streams
- Anything that needs sub-second reaction: the replay model has latency
Migration Path: From Queue-Based to Durable
You don’t have to rewrite everything. A practical migration:
Phase 1 — Wrap your queue consumers
Take an existing queue-based flow and wrap it in a workflow. The activities call your existing logic. You get observability and retry management immediately.
Phase 2 — Move state into workflow context
Replace the “order status” table that you update per-step with workflow signals and queries. The workflow history becomes your audit log.
Phase 3 — Add long-duration logic
Now that you have durable timers, implement the “retry in 24 hours” and “escalate if not resolved in 3 days” flows that were painful with cron.
Observability Built In
One underrated benefit: every workflow has a full history of what happened and when.
Workflow: order-fulfillment/ord_123abc
Status: Running (step: wait-for-warehouse)
Started: 2026-06-15T08:23:11Z
─────────────────────────────────────
[08:23:11] WorkflowExecutionStarted
[08:23:12] ActivityScheduled: charge-payment
[08:23:12] ActivityStarted: charge-payment
[08:23:13] ActivityCompleted: charge-payment → {paymentId: "pay_xyz"}
[08:23:13] TimerStarted: 2h (fires at 10:23:13)
Debugging production incidents becomes reading a log rather than correlating events across five services.
Determinism: The One Rule You Must Follow
Durable execution workflows must be deterministic — the same event history must always produce the same execution path. Common pitfalls:
# ❌ Non-deterministic — don't do this
@workflow.run
async def run(self):
# random() changes every replay → breaks history
delay = random.randint(1, 10)
await asyncio.sleep(delay)
# datetime.now() changes every replay
cutoff = datetime.now() - timedelta(days=7)
# ✅ Correct — use workflow APIs
@workflow.run
async def run(self):
delay = await workflow.execute_activity(get_random_delay) # captured in history
await asyncio.sleep(delay)
cutoff = workflow.now() - timedelta(days=7) # uses replayed time
Conclusion
Durable execution isn’t a replacement for every distributed system pattern — but it’s the right model for a large class of business workflows that teams have historically solved with fragile combinations of queues, cron jobs, status tables, and duct tape.
The 2026 ecosystem is mature enough to adopt in production. If you’re building a new service with multi-step business logic, start here before reaching for RabbitMQ or SQS.
References:
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
