AI Agents in Production: Architecture Patterns That Actually Work in 2026
on Ai, Llm, Agents, Architecture, Production, Mlops
AI agents have graduated from research projects to production workloads. In 2025, the narrative was “agents are almost ready.” In 2026, the conversation has shifted to “how do we make them reliable, observable, and cost-effective at scale?” Having worked through several agent deployments — from customer support automation to code review pipelines — I want to share the patterns that hold up and the pitfalls that will wreck your on-call rotation.
The Fundamental Problem with Agents
An LLM call is stateless. An agent — something that takes actions, remembers things, and pursues goals across multiple steps — is not. Bridging that gap requires you to make explicit decisions about state management, failure handling, and observability that a standard REST API never forces you to confront.
Most teams discover this by shipping something that works great in demos and fails in production. Let’s shortcut that journey.
Pattern 1: The Finite State Machine Agent
The most reliable agent architecture is one where you constrain the possible states and transitions explicitly.
Photo by Possessed Photography on Unsplash
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class AgentState(Enum):
IDLE = "idle"
PLANNING = "planning"
EXECUTING = "executing"
AWAITING_HUMAN = "awaiting_human"
REFLECTING = "reflecting"
DONE = "done"
FAILED = "failed"
@dataclass
class AgentContext:
state: AgentState
task: str
plan: Optional[list[str]] = None
completed_steps: list[str] = None
tool_results: list[dict] = None
error_count: int = 0
max_errors: int = 3
def __post_init__(self):
self.completed_steps = self.completed_steps or []
self.tool_results = self.tool_results or []
Why this works: when you enumerate states explicitly, you can:
- Write targeted tests for each state transition
- Add monitoring that alerts on unexpected states
- Implement clean recovery logic per state
- Reason about the agent’s behavior without reading LLM prompts
The anti-pattern is the “while True: call_llm()” loop with no explicit state — it’ll work until it doesn’t, and debugging is a nightmare.
Pattern 2: Hierarchical Agent Decomposition
Don’t build one giant agent that does everything. Build a hierarchy:
Orchestrator Agent
├── Research Sub-agent (web search, document retrieval)
├── Code Sub-agent (write, execute, test)
├── Communication Sub-agent (email drafts, summaries)
└── Verification Sub-agent (fact-check, review)
Each sub-agent has a narrow, well-defined capability surface. The orchestrator routes tasks and assembles results. This gives you:
Isolation: A failing code sub-agent doesn’t corrupt the research context.
Parallel execution: The orchestrator can fan out to multiple sub-agents and join results.
Independent optimization: You can use a cheap, fast model for simple classification tasks and a powerful model only for complex reasoning.
Testability: Each sub-agent is testable in isolation with defined inputs and expected outputs.
class OrchestratorAgent:
def __init__(self):
self.research = ResearchAgent(model="claude-3-haiku")
self.coder = CodeAgent(model="claude-sonnet-4-5")
self.verifier = VerificationAgent(model="claude-sonnet-4-5")
async def handle_task(self, task: str) -> AgentResult:
# Step 1: Plan with powerful model
plan = await self.plan(task)
# Step 2: Execute steps, routing to specialized sub-agents
results = []
for step in plan.steps:
if step.requires_research:
result = await self.research.execute(step)
elif step.requires_code:
result = await self.coder.execute(step)
results.append(result)
# Step 3: Verify before returning
verified = await self.verifier.check(results)
return verified
Pattern 3: Durable Execution with Checkpointing
The most expensive failure mode is an agent that completes 15 steps out of 20, hits an error, and starts from scratch. You need durable execution.
import json
import hashlib
from pathlib import Path
class DurableAgent:
def __init__(self, checkpoint_dir: str = "/tmp/agent_checkpoints"):
self.checkpoint_dir = Path(checkpoint_dir)
self.checkpoint_dir.mkdir(exist_ok=True)
def _checkpoint_key(self, task: str) -> str:
return hashlib.sha256(task.encode()).hexdigest()[:16]
def save_checkpoint(self, task: str, state: dict):
key = self._checkpoint_key(task)
path = self.checkpoint_dir / f"{key}.json"
path.write_text(json.dumps(state, default=str))
def load_checkpoint(self, task: str) -> Optional[dict]:
key = self._checkpoint_key(task)
path = self.checkpoint_dir / f"{key}.json"
if path.exists():
return json.loads(path.read_text())
return None
async def execute_with_recovery(self, task: str):
state = self.load_checkpoint(task) or {"steps": [], "step_index": 0}
plan = await self.plan(task)
start_from = state["step_index"]
for i, step in enumerate(plan.steps[start_from:], start=start_from):
result = await self.execute_step(step)
state["steps"].append(result)
state["step_index"] = i + 1
self.save_checkpoint(task, state) # Save after each step
return self.assemble_result(state["steps"])
In production, use a proper workflow engine (Temporal, Prefect, or Inngest) rather than rolling your own. The checkpoint pattern above is illustrative — production systems need distributed locks, TTLs, and proper error classification.
Pattern 4: Tool Call Budget Enforcement
Without limits, agents will use tools until they hit your rate limits or your invoice. Implement hard budgets:
Photo by Luke Chesser on Unsplash
from dataclasses import dataclass, field
@dataclass
class AgentBudget:
max_llm_calls: int = 20
max_tool_calls: int = 50
max_tokens_in: int = 100_000
max_tokens_out: int = 50_000
max_wall_time_seconds: int = 300
# Runtime tracking
llm_calls: int = field(default=0, init=False)
tool_calls: int = field(default=0, init=False)
tokens_in: int = field(default=0, init=False)
tokens_out: int = field(default=0, init=False)
start_time: float = field(default_factory=time.time, init=False)
def check(self, raise_on_exceeded: bool = True) -> bool:
exceeded = (
self.llm_calls >= self.max_llm_calls
or self.tool_calls >= self.max_tool_calls
or self.tokens_in >= self.max_tokens_in
or self.tokens_out >= self.max_tokens_out
or (time.time() - self.start_time) >= self.max_wall_time_seconds
)
if exceeded and raise_on_exceeded:
raise BudgetExceededException(self.summary())
return not exceeded
def summary(self) -> dict:
return {
"llm_calls": f"{self.llm_calls}/{self.max_llm_calls}",
"tool_calls": f"{self.tool_calls}/{self.max_tool_calls}",
"tokens_in": f"{self.tokens_in}/{self.max_tokens_in}",
}
Set per-task budgets based on empirical data from your staging environment, not intuition. A customer support task should rarely need more than 5 LLM calls. A deep research task might need 30. Know your workload.
Pattern 5: Human-in-the-Loop Escalation
The hardest decision in agent design is knowing when to escalate to a human. Build this into the architecture from day one, not as an afterthought:
class EscalationPolicy:
def should_escalate(self, context: AgentContext) -> tuple[bool, str]:
# Always escalate on high-stakes actions
if context.pending_action.affects_production:
return True, "Production changes require human approval"
# Escalate when confidence is low
if context.last_decision.confidence < 0.7:
return True, f"Low confidence ({context.last_decision.confidence:.0%})"
# Escalate on repeated failures
if context.error_count >= 2:
return True, f"Multiple failures: {context.last_error}"
# Escalate when cost is getting high
if context.budget.tokens_out > context.budget.max_tokens_out * 0.8:
return True, "Approaching token budget limit"
return False, ""
Make escalation fast and actionable. The human getting the escalation notification should see: what the agent was doing, why it escalated, and what decision is needed — all in under 30 seconds of reading.
Observability: The Missing Piece
Most agent failures I’ve seen were actually observable — the signals were there, but nobody was looking. Instrument these dimensions:
Trace every LLM call: input tokens, output tokens, latency, model, temperature.
Log tool calls with context: what the agent was trying to do, not just what tool it called.
Track goal completion rate: not just “did the agent finish” but “did it achieve the actual goal.”
Monitor reasoning quality: have a lightweight classifier check whether the agent’s plans are coherent. This catches prompt injection and context window issues early.
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("agent.runtime")
class InstrumentedAgent:
async def call_llm(self, messages: list, **kwargs) -> str:
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("llm.model", kwargs.get("model"))
span.set_attribute("llm.input_tokens", count_tokens(messages))
try:
response = await self._client.chat(messages, **kwargs)
span.set_attribute("llm.output_tokens", response.usage.output_tokens)
span.set_status(Status(StatusCode.OK))
return response.content
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
Cost Model: What to Expect
Based on production deployments in 2026:
| Agent Type | Avg LLM Calls | Avg Cost/Task | P99 Cost/Task |
|---|---|---|---|
| Customer Support | 4-6 | $0.02 | $0.15 |
| Code Review | 8-12 | $0.08 | $0.40 |
| Research Summary | 15-25 | $0.25 | $1.20 |
| Autonomous Dev Task | 30-60 | $0.80 | $4.00 |
The P99 is the number that will surprise you. A small percentage of tasks spiral — often due to ambiguous goals, corrupted context, or adversarial inputs. Budget enforcement is what keeps those from being $40 tasks.
What Still Doesn’t Work Well
Being honest about current limitations:
Long-horizon planning: Agents still struggle to maintain coherent goals over 50+ steps. Context window management becomes critical, and current approaches (summarization, sliding windows) introduce information loss.
Multi-agent coordination: True multi-agent systems where multiple agents negotiate and share state are brittle. The “society of agents” vision is compelling but the engineering is hard. Stick to clear hierarchies.
Adversarial robustness: Prompt injection through tool results is a real attack surface. If your agent reads external content (web pages, emails, documents), it’s exposed. Sanitization and sandboxing are non-negotiable.
Getting Started
If you’re building your first production agent system:
- Start with a state machine. Seriously.
- Add budget enforcement before you write a single feature.
- Instrument everything before you ship.
- Build human escalation first, automation second.
- Run chaos experiments (kill random steps, inject bad tool results) before going live.
The teams shipping reliable agents aren’t the ones with the fanciest architectures. They’re the ones who treated agents like any other distributed system: with respect for failure modes and investment in observability.
References
- LangGraph Documentation — excellent for FSM-style agents
- Temporal Workflow Engine — production-grade durable execution
- OpenTelemetry for AI — emerging semantic conventions for LLM observability
- Anthropic’s Agent Framework Guide — canonical patterns from the model provider
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
