Durable Execution: Why Temporal and Workflow Engines Are Replacing Queues and Cron Jobs



Durable Execution: Why Temporal and Workflow Engines Are Replacing Queues and Cron Jobs

Every sufficiently large backend system eventually builds the same things: retry queues for failed jobs, cron schedulers for periodic tasks, state machines for multi-step processes, distributed locks for coordination. You build them in Redis, in Postgres, in SQS + Lambda, in whatever you have. And then you spend years maintaining them as they fail in creative ways.

Durable execution is the paradigm that makes most of this complexity unnecessary. Temporal, Restate, and Cloudflare Durable Objects let you write plain functions that automatically survive process crashes, network failures, and service restarts — without any explicit state management code.

This post covers what durable execution actually is, how Temporal implements it, and practical patterns for common use cases that traditionally require complex infrastructure.

Abstract workflow diagram showing connected nodes and data flow Photo by Sigmund on Unsplash


The Problem: Everything Can Fail

Consider a standard “new user signup” flow:

1. Create user in database
2. Send welcome email
3. Create Stripe customer
4. Provision user's initial resources
5. Send to analytics
6. Notify Slack #new-signups

In traditional code, this is a nightmare of failure handling:

# Traditional approach — every step can fail, and you have to handle it
async def handle_signup(user_data: dict):
    # Step 1 succeeds
    user = await db.create_user(user_data)
    
    try:
        # Step 2: what if the email service is down?
        await email.send_welcome(user.email)
    except EmailServiceError:
        # Do we retry? How many times? When? Who tracks this?
        await retry_queue.push("send_welcome", user.id, delay=60)
    
    try:
        # Step 3: what if we crash here?
        # User is created, email is sending, but no Stripe customer
        # On retry, we'll create a DUPLICATE user!
        stripe_customer = await stripe.create_customer(user.email)
        await db.update_user(user.id, stripe_customer_id=stripe_customer.id)
    except StripeError as e:
        # Save state somehow so we can resume from here, not from the start
        await state_store.save(user.id, step=3, error=str(e))
        raise
    
    # ... and on and on for each step

The real code is full of retry logic, idempotency keys, checkpoint saves, and race condition guards. You’re writing infrastructure, not business logic.


Durable Execution: The Model

Durable execution frameworks give you one primitive: a function that always completes, even if the process crashes a hundred times before it finishes.

The system records every step as the workflow executes. On crash/restart, it replays the history to restore the function’s state — without re-executing completed steps.

# Temporal workflow — the same signup flow
@workflow.defn
class UserSignupWorkflow:
    @workflow.run
    async def run(self, user_data: dict) -> str:
        # Step 1: Create user
        # If this crashes halfway, Temporal retries from BEFORE this step
        # If this succeeds and we crash, Temporal replays and SKIPS re-executing
        user_id = await workflow.execute_activity(
            create_user,
            user_data,
            start_to_close_timeout=timedelta(seconds=10),
        )

        # Step 2–6: Each activity is durable
        await workflow.execute_activity(
            send_welcome_email,
            user_id,
            # Automatically retried with backoff if email service is down
            retry_policy=RetryPolicy(
                maximum_attempts=10,
                initial_interval=timedelta(seconds=5),
                maximum_interval=timedelta(minutes=10),
            ),
        )
        
        # These run concurrently — Temporal handles the fan-out
        await asyncio.gather(
            workflow.execute_activity(create_stripe_customer, user_id),
            workflow.execute_activity(provision_resources, user_id),
            workflow.execute_activity(track_analytics_event, user_id),
        )
        
        await workflow.execute_activity(notify_slack, user_id)
        
        return user_id

If the process crashes between step 2 and step 3, Temporal replays the workflow history, confirms that steps 1 and 2 completed, and continues from step 3. No duplicate database rows. No orphaned Stripe customers. No manual checkpoint saves.


Core Temporal Concepts

Workflows

Workflows are the orchestrators — they define the flow of steps, handle branching logic, and manage long-running state. They must be deterministic (no random numbers, no datetime.now() — use Temporal’s equivalents).

@workflow.defn
class OrderFulfillmentWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> OrderResult:
        # Long-running — this can take days and survive any number of restarts
        
        # Wait for payment confirmation (signal from external system)
        await workflow.wait_condition(lambda: self._payment_confirmed)
        
        # Schedule shipping — waits up to 24h for warehouse confirmation
        tracking = await workflow.execute_activity(
            schedule_shipping,
            order_id,
            schedule_to_close_timeout=timedelta(hours=24),
        )
        
        # Wait 3 days for delivery, then check status
        await asyncio.sleep(timedelta(days=3).total_seconds())
        
        delivery_status = await workflow.execute_activity(
            check_delivery_status,
            tracking,
        )
        
        if delivery_status == "delivered":
            await workflow.execute_activity(release_payment, order_id)
            return OrderResult(success=True)
        else:
            await workflow.execute_activity(initiate_investigation, order_id)
            return OrderResult(success=False, action="investigating")

    @workflow.signal
    async def payment_confirmed(self):
        self._payment_confirmed = True

This workflow can span days. The await asyncio.sleep(timedelta(days=3).total_seconds()) is not a real sleep — it schedules a timer in Temporal’s server and the worker process can be restarted any number of times during those 3 days.

Activities

Activities are the actual work — calling external APIs, writing to databases, sending notifications. They can be non-deterministic and can fail.

@activity.defn
async def create_stripe_customer(user_id: str) -> str:
    # Heartbeat so Temporal knows we're still alive
    activity.heartbeat("Creating Stripe customer...")
    
    user = await db.get_user(user_id)
    
    # Idempotency key — safe to retry without creating duplicates
    customer = await stripe.customers.create(
        email=user.email,
        idempotency_key=f"stripe-customer-{user_id}",
    )
    
    await db.update_user(user_id, stripe_customer_id=customer.id)
    return customer.id

Workers

Workers poll Temporal’s task queue and execute workflows and activities:

async def main():
    client = await Client.connect("localhost:7233")
    
    worker = Worker(
        client,
        task_queue="user-signup",
        workflows=[UserSignupWorkflow, OrderFulfillmentWorkflow],
        activities=[
            create_user,
            send_welcome_email,
            create_stripe_customer,
            provision_resources,
            track_analytics_event,
            notify_slack,
        ],
    )
    
    await worker.run()

Patterns: What Durable Execution Replaces

Replace: Cron + Database State

Old approach:

# Cron job that runs every hour
def hourly_billing_job():
    # Find all subscriptions due for renewal
    due = db.query("SELECT * FROM subscriptions WHERE next_billing < NOW()")
    for sub in due:
        try:
            charge(sub)
            db.update(sub.id, next_billing=next_month())
        except Exception:
            db.update(sub.id, retry_count=sub.retry_count+1)

Temporal approach:

@workflow.defn
class SubscriptionBillingWorkflow:
    @workflow.run
    async def run(self, subscription_id: str):
        while True:  # Infinite loop — Temporal handles this safely
            # Wait until next billing date
            next_billing = await workflow.execute_activity(
                get_next_billing_date, subscription_id
            )
            await workflow.sleep_until(next_billing)
            
            # Attempt charge with automatic retry
            await workflow.execute_activity(
                charge_subscription,
                subscription_id,
                retry_policy=RetryPolicy(
                    maximum_attempts=5,
                    non_retryable_error_types=["CardDeclinedError"],
                ),
            )
            
            # Update next billing date
            await workflow.execute_activity(
                advance_billing_period, subscription_id
            )

One workflow per subscription. No cron job. No retry tracking in the database. Temporal maintains the timer and state.

Replace: Queue + Worker + Dead Letter Queue

Old approach:

SQS queue → Lambda → retry up to 3 times → DLQ → manual investigation

Temporal approach:

@workflow.defn  
class DocumentProcessingWorkflow:
    @workflow.run
    async def run(self, document_id: str):
        # Extract text — retry 10 times with exponential backoff
        text = await workflow.execute_activity(
            extract_text, document_id,
            retry_policy=RetryPolicy(maximum_attempts=10),
        )
        
        # Classify — might need human review
        classification = await workflow.execute_activity(classify_document, text)
        
        if classification.confidence < 0.8:
            # Signal workflow to wait for human input
            # This can wait days without holding a thread
            await workflow.wait_condition(
                lambda: self._human_review_done,
                timeout=timedelta(days=7),
            )
            classification = self._human_classification
        
        await workflow.execute_activity(store_classification, document_id, classification)

No dead letter queue. Failed workflows are visible in Temporal’s UI. Human-in-the-loop is a native pattern (signals + wait_condition).

Developer looking at monitoring dashboards on multiple screens Photo by Ilya Pavlov on Unsplash


Temporal vs Alternatives

PlatformBest ForLanguage SupportHosting
TemporalComplex business workflows, long-runningPython, Go, Java, TypeScript, .NET, PHPSelf-hosted or Temporal Cloud
RestateMicroservice orchestration, event-drivenJava/Kotlin, TypeScript, GoSelf-hosted or Restate Cloud
AWS Step FunctionsAWS-native, visual workflowsAny (via Lambda)Managed (AWS)
Prefect / DagsterData pipelines, ML workflowsPythonSelf-hosted or cloud
Cloudflare Durable ObjectsEdge-local state, real-timeTypeScript/JavaScriptCloudflare Workers

Temporal is the most general and production-proven. Restate is gaining traction for its excellent developer experience and gRPC-native design. Step Functions are the path of least resistance if you’re all-in on AWS.


Getting Started with Temporal

Local Development

# Start Temporal server locally
brew install temporal
temporal server start-dev

# Open the UI
open http://localhost:8233

Python SDK Quickstart

pip install temporalio
# Complete working example

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.client import Client
from temporalio.worker import Worker

@activity.defn
async def say_hello(name: str) -> str:
    return f"Hello, {name}!"

@workflow.defn
class GreetingWorkflow:
    @workflow.run
    async def run(self, name: str) -> str:
        return await workflow.execute_activity(
            say_hello,
            name,
            start_to_close_timeout=timedelta(seconds=10),
        )

# Start a workflow
async def main():
    client = await Client.connect("localhost:7233")
    result = await client.execute_workflow(
        GreetingWorkflow.run,
        "World",
        id="greeting-workflow-1",
        task_queue="greeting-tasks",
    )
    print(result)  # "Hello, World!"

# Run a worker
async def run_worker():
    client = await Client.connect("localhost:7233")
    worker = Worker(client, task_queue="greeting-tasks",
                    workflows=[GreetingWorkflow], activities=[say_hello])
    await worker.run()

When NOT to Use Durable Execution

Durable execution is not always the right tool:

  • Simple background jobs — if your job is “resize this image” with no fan-out, retries, or long waits, a basic queue (BullMQ, SQS, Celery) is simpler and cheaper
  • Sub-100ms latency requirements — Temporal adds overhead for the persistence layer; it’s not suitable for hot paths
  • Stateless batch processing — Spark, Flink, or even simple Python scripts are better for “process 10M rows” jobs
  • Very high throughput events — Temporal is not an event streaming platform; use Kafka for high-throughput event pipelines

Key Takeaways

Durable execution is the right abstraction for:

  • Multi-step business processes that must complete reliably
  • Long-running workflows (hours to days)
  • Processes requiring human-in-the-loop
  • Complex retry and compensation logic
  • Replacing ad-hoc “retry job” tables in your database

The shift is conceptual: instead of designing a system that handles failures (queues, retries, checkpoints), you write the happy path and the framework handles failures for you. For the right use cases, this is a 5–10x reduction in code complexity and a dramatic improvement in reliability.

Start with your messiest, most fragile background job — the one with the “retry_count” column in the database and a Slack alert for when it fails. Replace it with Temporal. You’ll probably convert everything else shortly after.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)