AI Agents in Production: Real-World Patterns That Actually Work

A year ago, “AI agents” meant demos on YouTube. Today, they’re pulling orders, triaging support tickets, writing code, and managing pipelines inside real production systems. The gap between what works in a notebook and what survives contact with production has never been more instructive.

This post is a distillation of patterns that are working — and anti-patterns that will ruin your week.

Robot arm working alongside human hands on a laptop keyboard Photo by Alex Knight on Unsplash

Why Agents Break in Production (And Why It’s Predictable)

The core problem with agents is that they’re non-deterministic systems running inside deterministic infrastructure. Your CI/CD pipeline expects reliable outputs. Your customers expect consistent behavior. Agents, by their nature, take different paths on different runs.

This mismatch produces three failure modes:

Runaway loops: An agent calls a tool, interprets the result ambiguously, and retries indefinitely
Silent corruption: The agent completes its task but corrupts state in a way that isn’t caught until hours later
Confidence hallucination: The agent takes a confident action on incorrect premises, without flagging uncertainty

None of these are unsolvable. But they require different thinking than traditional software engineering.

Pattern 1: The Narrow-Scope Agent

The most reliable agents in production share one trait: they do exactly one thing.

class InvoiceExtractionAgent:
    """
    Single responsibility: extract structured data from invoice PDFs.
    Does NOT: classify invoices, route them, or update the database.
    """
    
    def __init__(self, llm_client, extraction_schema):
        self.client = llm_client
        self.schema = extraction_schema
        self.max_retries = 3
    
    def extract(self, pdf_path: str) -> InvoiceData:
        raw_text = self._read_pdf(pdf_path)
        
        for attempt in range(self.max_retries):
            result = self.client.complete(
                system=EXTRACTION_SYSTEM_PROMPT,
                user=f"Extract data from:\n\n{raw_text}",
                response_format=self.schema
            )
            
            if self._validate(result):
                return result
            
        raise ExtractionFailure(f"Failed after {self.max_retries} attempts")

The key insight here: the agent doesn’t decide what to do with the result. That’s someone else’s job. Composition over monoliths.

Pattern 2: Structured Output as a Contract

Unstructured LLM output in production is a time bomb. A model that returns freeform text will, eventually, return something your downstream code can’t parse. And it will happen at 2 AM.

Structured outputs — enforced via JSON Schema, Pydantic, or the model provider’s native structured output API — transform LLM responses into actual contracts.

from pydantic import BaseModel, Field
from typing import Literal

class TicketTriage(BaseModel):
    priority: Literal["critical", "high", "medium", "low"]
    category: Literal["billing", "technical", "account", "other"]
    sentiment: Literal["frustrated", "neutral", "satisfied"]
    suggested_team: str = Field(max_length=50)
    requires_human: bool
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(max_length=500)

# Now the output is *guaranteed* to match this shape.
# If the model can't comply, the API returns an error — not garbage JSON.
triage = openai_client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[...],
    response_format=TicketTriage,
)

Notice the confidence field. This is critical: always ask the model to rate its own certainty. When confidence drops below a threshold, route to a human. It’s a simple but powerful escape hatch.

Pattern 3: The Human-in-the-Loop Checkpoint

The term “autonomous agent” is a spectrum, not a binary. The most successful production agents aren’t fully autonomous — they have well-defined checkpoints where humans can intervene.

Control room dashboard showing automated systems being monitored by humans Photo by Luke Chesser on Unsplash

Think of it as a confidence threshold system:

class AgentOrchestrator:
    def __init__(self, agent, approval_queue, confidence_threshold=0.85):
        self.agent = agent
        self.queue = approval_queue
        self.threshold = confidence_threshold
    
    async def run(self, task):
        result = await self.agent.plan(task)
        
        # High confidence actions: execute immediately
        immediate = [a for a in result.actions if a.confidence >= self.threshold]
        
        # Low confidence actions: queue for human review
        pending = [a for a in result.actions if a.confidence < self.threshold]
        
        if pending:
            await self.queue.submit(pending, context=result.reasoning)
            # Don't block — return partial results and continue async
        
        return await self.agent.execute(immediate)

This pattern has an important property: it degrades gracefully. Even if every action is below threshold, nothing breaks. The system just routes everything to humans. It’s the opposite of an agent that fails silently.

Pattern 4: Idempotent Tool Calls

Agents frequently retry. Network blips, ambiguous responses, context window resets — all of these can cause a tool to be called multiple times. If your tools aren’t idempotent, this will create duplicate orders, double-sent emails, and corrupted state.

The fix is the same as in distributed systems: idempotency keys.

def send_notification(user_id: str, message: str, idempotency_key: str) -> bool:
    """
    Idempotency key ensures this notification is sent at most once,
    even if the tool is called multiple times by the agent.
    """
    if redis.exists(f"notif:{idempotency_key}"):
        return True  # Already sent, silently succeed
    
    result = notification_service.send(user_id, message)
    
    if result.success:
        redis.setex(f"notif:{idempotency_key}", ttl=86400, value="sent")
    
    return result.success

Teach your agents to generate idempotency keys (a combination of task ID + tool name + input hash works well). Then enforce idempotency in every tool implementation.

Pattern 5: Observability-First Design

You cannot debug what you cannot observe. Agents, with their multi-step reasoning and non-deterministic paths, require deeper observability than traditional services.

What you need to capture:

Every prompt sent and response received (with token counts)
Every tool call with input, output, and latency
The full reasoning trace (if using chain-of-thought)
The final action taken and its outcome
Any retries and why they occurred

Frameworks like LangSmith, Weights & Biases, and Arize AI have purpose-built UIs for this. But even a structured logging approach gets you 80% of the value:

import structlog

log = structlog.get_logger()

class InstrumentedAgent:
    async def call_tool(self, tool_name: str, inputs: dict) -> dict:
        start = time.monotonic()
        
        log.info("tool_call_start", 
                 tool=tool_name, 
                 inputs=inputs,
                 session_id=self.session_id)
        
        try:
            result = await self.tools[tool_name](**inputs)
            
            log.info("tool_call_success",
                     tool=tool_name,
                     latency_ms=(time.monotonic() - start) * 1000,
                     result_keys=list(result.keys()))
            
            return result
        
        except Exception as e:
            log.error("tool_call_failure",
                      tool=tool_name,
                      error=str(e),
                      latency_ms=(time.monotonic() - start) * 1000)
            raise

The Anti-Patterns Worth Naming

Don’t give agents unbounded internet access. Every unrestricted tool is a potential attack surface and a source of unpredictable behavior. Allowlist the tools an agent can call.

Don’t skip evals. Agents need test suites just like any other software. Golden-path tests, adversarial inputs, edge cases — all of it. LLM behavior drifts between model versions. Evals catch regressions.

Don’t assume the model knows your domain. System prompts should be dense with context: your business rules, edge cases, what to do when uncertain. A generic agent is a fragile agent.

Don’t deploy without a kill switch. Every agent in production should have a circuit breaker. When error rates spike, traffic should automatically route to a fallback (human queue, simpler rule-based system, or graceful failure message).

Where This Is All Going

The teams doing this well in 2026 aren’t building general-purpose agents. They’re building purpose-fit automations that happen to use LLMs as the reasoning layer. The agent framing is useful for thinking about design, but the production reality is narrower.

The next wave will be multi-agent systems — orchestrators coordinating specialist agents — but we’re not quite there yet in terms of reliability patterns. That’s a post for another day.

For now, narrow scope + structured outputs + human checkpoints + idempotent tools + deep observability. That’s the stack. That’s what ships.

If you’re building AI agents and want to compare notes, I’m @DevStarSJ on GitHub.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)