AI Agents in Production: Real-World Patterns That Actually Work
on Ai, Llm, Agents, Mlops, Production
A year ago, “AI agents” meant demos on YouTube. Today, they’re pulling orders, triaging support tickets, writing code, and managing pipelines inside real production systems. The gap between what works in a notebook and what survives contact with production has never been more instructive.
This post is a distillation of patterns that are working — and anti-patterns that will ruin your week.
Photo by Alex Knight on Unsplash
Why Agents Break in Production (And Why It’s Predictable)
The core problem with agents is that they’re non-deterministic systems running inside deterministic infrastructure. Your CI/CD pipeline expects reliable outputs. Your customers expect consistent behavior. Agents, by their nature, take different paths on different runs.
This mismatch produces three failure modes:
- Runaway loops: An agent calls a tool, interprets the result ambiguously, and retries indefinitely
- Silent corruption: The agent completes its task but corrupts state in a way that isn’t caught until hours later
- Confidence hallucination: The agent takes a confident action on incorrect premises, without flagging uncertainty
None of these are unsolvable. But they require different thinking than traditional software engineering.
Pattern 1: The Narrow-Scope Agent
The most reliable agents in production share one trait: they do exactly one thing.
class InvoiceExtractionAgent:
"""
Single responsibility: extract structured data from invoice PDFs.
Does NOT: classify invoices, route them, or update the database.
"""
def __init__(self, llm_client, extraction_schema):
self.client = llm_client
self.schema = extraction_schema
self.max_retries = 3
def extract(self, pdf_path: str) -> InvoiceData:
raw_text = self._read_pdf(pdf_path)
for attempt in range(self.max_retries):
result = self.client.complete(
system=EXTRACTION_SYSTEM_PROMPT,
user=f"Extract data from:\n\n{raw_text}",
response_format=self.schema
)
if self._validate(result):
return result
raise ExtractionFailure(f"Failed after {self.max_retries} attempts")
The key insight here: the agent doesn’t decide what to do with the result. That’s someone else’s job. Composition over monoliths.
Pattern 2: Structured Output as a Contract
Unstructured LLM output in production is a time bomb. A model that returns freeform text will, eventually, return something your downstream code can’t parse. And it will happen at 2 AM.
Structured outputs — enforced via JSON Schema, Pydantic, or the model provider’s native structured output API — transform LLM responses into actual contracts.
from pydantic import BaseModel, Field
from typing import Literal
class TicketTriage(BaseModel):
priority: Literal["critical", "high", "medium", "low"]
category: Literal["billing", "technical", "account", "other"]
sentiment: Literal["frustrated", "neutral", "satisfied"]
suggested_team: str = Field(max_length=50)
requires_human: bool
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str = Field(max_length=500)
# Now the output is *guaranteed* to match this shape.
# If the model can't comply, the API returns an error — not garbage JSON.
triage = openai_client.beta.chat.completions.parse(
model="gpt-4o",
messages=[...],
response_format=TicketTriage,
)
Notice the confidence field. This is critical: always ask the model to rate its own certainty. When confidence drops below a threshold, route to a human. It’s a simple but powerful escape hatch.
Pattern 3: The Human-in-the-Loop Checkpoint
The term “autonomous agent” is a spectrum, not a binary. The most successful production agents aren’t fully autonomous — they have well-defined checkpoints where humans can intervene.
Photo by Luke Chesser on Unsplash
Think of it as a confidence threshold system:
class AgentOrchestrator:
def __init__(self, agent, approval_queue, confidence_threshold=0.85):
self.agent = agent
self.queue = approval_queue
self.threshold = confidence_threshold
async def run(self, task):
result = await self.agent.plan(task)
# High confidence actions: execute immediately
immediate = [a for a in result.actions if a.confidence >= self.threshold]
# Low confidence actions: queue for human review
pending = [a for a in result.actions if a.confidence < self.threshold]
if pending:
await self.queue.submit(pending, context=result.reasoning)
# Don't block — return partial results and continue async
return await self.agent.execute(immediate)
This pattern has an important property: it degrades gracefully. Even if every action is below threshold, nothing breaks. The system just routes everything to humans. It’s the opposite of an agent that fails silently.
Pattern 4: Idempotent Tool Calls
Agents frequently retry. Network blips, ambiguous responses, context window resets — all of these can cause a tool to be called multiple times. If your tools aren’t idempotent, this will create duplicate orders, double-sent emails, and corrupted state.
The fix is the same as in distributed systems: idempotency keys.
def send_notification(user_id: str, message: str, idempotency_key: str) -> bool:
"""
Idempotency key ensures this notification is sent at most once,
even if the tool is called multiple times by the agent.
"""
if redis.exists(f"notif:{idempotency_key}"):
return True # Already sent, silently succeed
result = notification_service.send(user_id, message)
if result.success:
redis.setex(f"notif:{idempotency_key}", ttl=86400, value="sent")
return result.success
Teach your agents to generate idempotency keys (a combination of task ID + tool name + input hash works well). Then enforce idempotency in every tool implementation.
Pattern 5: Observability-First Design
You cannot debug what you cannot observe. Agents, with their multi-step reasoning and non-deterministic paths, require deeper observability than traditional services.
What you need to capture:
- Every prompt sent and response received (with token counts)
- Every tool call with input, output, and latency
- The full reasoning trace (if using chain-of-thought)
- The final action taken and its outcome
- Any retries and why they occurred
Frameworks like LangSmith, Weights & Biases, and Arize AI have purpose-built UIs for this. But even a structured logging approach gets you 80% of the value:
import structlog
log = structlog.get_logger()
class InstrumentedAgent:
async def call_tool(self, tool_name: str, inputs: dict) -> dict:
start = time.monotonic()
log.info("tool_call_start",
tool=tool_name,
inputs=inputs,
session_id=self.session_id)
try:
result = await self.tools[tool_name](**inputs)
log.info("tool_call_success",
tool=tool_name,
latency_ms=(time.monotonic() - start) * 1000,
result_keys=list(result.keys()))
return result
except Exception as e:
log.error("tool_call_failure",
tool=tool_name,
error=str(e),
latency_ms=(time.monotonic() - start) * 1000)
raise
The Anti-Patterns Worth Naming
Don’t give agents unbounded internet access. Every unrestricted tool is a potential attack surface and a source of unpredictable behavior. Allowlist the tools an agent can call.
Don’t skip evals. Agents need test suites just like any other software. Golden-path tests, adversarial inputs, edge cases — all of it. LLM behavior drifts between model versions. Evals catch regressions.
Don’t assume the model knows your domain. System prompts should be dense with context: your business rules, edge cases, what to do when uncertain. A generic agent is a fragile agent.
Don’t deploy without a kill switch. Every agent in production should have a circuit breaker. When error rates spike, traffic should automatically route to a fallback (human queue, simpler rule-based system, or graceful failure message).
Where This Is All Going
The teams doing this well in 2026 aren’t building general-purpose agents. They’re building purpose-fit automations that happen to use LLMs as the reasoning layer. The agent framing is useful for thinking about design, but the production reality is narrower.
The next wave will be multi-agent systems — orchestrators coordinating specialist agents — but we’re not quite there yet in terms of reliability patterns. That’s a post for another day.
For now, narrow scope + structured outputs + human checkpoints + idempotent tools + deep observability. That’s the stack. That’s what ships.
If you’re building AI agents and want to compare notes, I’m @DevStarSJ on GitHub.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
