Building Production AI Agents in 2026: Tool Use, Orchestration, and Reliability at Scale
on Ai agents, Llm, Tool use, Langgraph, Automation, Anthropic, Openai
Building Production AI Agents in 2026: Tool Use, Orchestration, and Reliability at Scale
AI agents broke into mainstream consciousness in 2024. By 2025, every enterprise had pilots running. In 2026, the question is no longer “should we build agents?” — it’s “why do our agents keep failing in production, and how do we fix that?”
This is the post for teams who have moved beyond demos and are wrestling with the hard problems: reliability, observability, cost, and knowing when to hand back control to a human.
Photo by Google DeepMind on Unsplash
What Actually Goes Wrong with AI Agents
Before solutions, let’s be honest about the failure modes:
- Tool call loops — the agent keeps calling the same tool expecting different results
- Context window exhaustion — long conversations cause the model to lose track of earlier state
- Hallucinated tool arguments — the model invents parameters that don’t exist
- Irreversible actions — the agent deletes a record or sends an email before you realize it misunderstood
- Cost explosions — a looping agent burns $500 in API calls overnight
- Silent failures — the agent returns a confident answer based on a failed tool call it didn’t notice
Addressing these isn’t a prompt engineering problem. It’s a systems engineering problem.
The Architecture That Actually Works
The most reliable production agent architecture in 2026 follows this pattern:
┌─────────────────────────────────────────────────────┐
│ Agent Orchestrator │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Planning │ → │ Tool │ → │ Verification │ │
│ │ Loop │ │ Executor │ │ & Guard │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ ↑ │ │ │
│ └───────────────┘ │ │
│ ↓ │
│ ┌────────────┐ ┌──────────────────────────────┐ │
│ │ Memory │ │ Human Escalation Queue │ │
│ │ (short + │ │ (ambiguous / risky actions) │ │
│ │ long-term)│ └──────────────────────────────┘ │
│ └────────────┘ │
└─────────────────────────────────────────────────────┘
Key principle: the orchestrator controls the agent, not the other way around. The model decides what to do; the orchestrator decides whether to allow it.
Implementing a Reliable Agent with LangGraph
LangGraph (from LangChain) has become the standard for stateful, graph-based agent workflows. Here’s a production-grade implementation:
Setup
pip install langgraph langchain-anthropic langchain-core
Define Tools with Strict Schemas
from langchain_core.tools import tool
from pydantic import BaseModel, Field
from typing import Literal
class SearchInput(BaseModel):
query: str = Field(description="The search query, max 200 chars")
max_results: int = Field(default=5, ge=1, le=20)
class DatabaseQueryInput(BaseModel):
table: Literal["users", "orders", "products"] = Field(
description="Table to query — only these three are allowed"
)
filter_field: str = Field(description="Column name to filter on")
filter_value: str = Field(description="Value to filter by")
limit: int = Field(default=10, ge=1, le=100)
@tool("web_search", args_schema=SearchInput)
def web_search(query: str, max_results: int = 5) -> list[dict]:
"""Search the web for current information."""
# Real implementation would call Brave/Tavily/etc.
return [{"title": f"Result for {query}", "url": "https://example.com", "snippet": "..."}]
@tool("query_database", args_schema=DatabaseQueryInput)
def query_database(table: str, filter_field: str, filter_value: str, limit: int = 10) -> list[dict]:
"""Query the internal database. Only specific tables are accessible."""
# Parameterized query — no SQL injection possible
return []
@tool("send_email")
def send_email(to: str, subject: str, body: str) -> dict:
"""Send an email. CAUTION: This action is irreversible."""
# This tool should always require human approval
return {"status": "sent", "message_id": "mock-123"}
Build the Agent Graph
import anthropic
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_anthropic import ChatAnthropic
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
tool_call_count: int
requires_human_approval: bool
task_complete: bool
# High-capability model for planning
planner_model = ChatAnthropic(
model="claude-opus-4-5",
max_tokens=4096,
).bind_tools([web_search, query_database, send_email])
# Define which tools require human approval before execution
REQUIRES_APPROVAL = {"send_email"}
MAX_TOOL_CALLS = 20 # Guard against infinite loops
def should_continue(state: AgentState) -> Literal["tools", "human_review", "end"]:
messages = state["messages"]
last_message = messages[-1]
# Check for completion
if not last_message.tool_calls:
return "end"
# Guard: too many tool calls
if state["tool_call_count"] >= MAX_TOOL_CALLS:
return "end"
# Check if any called tool requires approval
called_tools = {tc["name"] for tc in last_message.tool_calls}
if called_tools & REQUIRES_APPROVAL:
return "human_review"
return "tools"
def call_model(state: AgentState) -> AgentState:
response = planner_model.invoke(state["messages"])
return {
"messages": [response],
"tool_call_count": state["tool_call_count"] + len(getattr(response, "tool_calls", []) or []),
"requires_human_approval": False,
"task_complete": not bool(getattr(response, "tool_calls", None)),
}
def request_human_review(state: AgentState) -> AgentState:
"""Pause execution and flag for human review."""
last_message = state["messages"][-1]
pending_calls = [tc["name"] for tc in last_message.tool_calls]
print(f"\n⚠️ HUMAN APPROVAL REQUIRED")
print(f"Pending tool calls: {pending_calls}")
print(f"Last message: {last_message.content}")
# In production: push to approval queue, suspend execution
# Here we simulate rejection for demo
return {
"messages": state["messages"],
"requires_human_approval": True,
"task_complete": True,
"tool_call_count": state["tool_call_count"],
}
# Build the graph
tools = [web_search, query_database, send_email]
tool_node = ToolNode(tools)
workflow = StateGraph(AgentState)
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)
workflow.add_node("human_review", request_human_review)
workflow.set_entry_point("agent")
workflow.add_conditional_edges(
"agent",
should_continue,
{
"tools": "tools",
"human_review": "human_review",
"end": END,
}
)
workflow.add_edge("tools", "agent")
workflow.add_edge("human_review", END)
agent = workflow.compile()
Run with Observability
from langchain_core.messages import HumanMessage
import json
def run_agent(task: str) -> dict:
initial_state = {
"messages": [HumanMessage(content=task)],
"tool_call_count": 0,
"requires_human_approval": False,
"task_complete": False,
}
print(f"🤖 Starting agent for task: {task[:100]}...")
final_state = agent.invoke(initial_state)
result = {
"task": task,
"tool_calls_made": final_state["tool_call_count"],
"required_human": final_state["requires_human_approval"],
"complete": final_state["task_complete"],
"response": final_state["messages"][-1].content,
}
print(f"✅ Complete. Tool calls: {result['tool_calls_made']}, Human escalation: {result['required_human']}")
return result
# Example
result = run_agent("Find the top 3 competitors to our product in the enterprise SaaS space and summarize their pricing models")
Memory: The Missing Layer
Most agent failures trace back to poor memory management. Here’s a practical layered approach:
from dataclasses import dataclass, field
from typing import Optional
import json
@dataclass
class AgentMemory:
# In-context memory (lost after session)
working_memory: list[dict] = field(default_factory=list)
# Episodic memory (recent sessions, fast retrieval)
episodic_store: dict = field(default_factory=dict) # In prod: Redis
# Semantic memory (vector DB for long-term knowledge)
# semantic_store: VectorDB # In prod: Pinecone, Weaviate, pgvector
def add_observation(self, key: str, value: str, importance: float = 0.5):
"""Store an observation with importance score for later pruning."""
self.working_memory.append({
"key": key,
"value": value,
"importance": importance,
"timestamp": __import__("time").time(),
})
# Prune low-importance items if working memory is large
if len(self.working_memory) > 50:
self.working_memory.sort(key=lambda x: x["importance"], reverse=True)
self.working_memory = self.working_memory[:30]
def get_context_summary(self) -> str:
"""Compress working memory into a context string for the next LLM call."""
if not self.working_memory:
return "No prior context."
items = sorted(self.working_memory, key=lambda x: x["importance"], reverse=True)[:10]
lines = [f"- {item['key']}: {item['value']}" for item in items]
return "Key context:\n" + "\n".join(lines)
Cost Control: Essential for Production
Without guardrails, agents can burn your budget overnight:
import anthropic
from dataclasses import dataclass
@dataclass
class CostTracker:
max_cost_usd: float = 1.0
current_cost_usd: float = 0.0
# Claude Sonnet 4.5 pricing (approximate)
INPUT_COST_PER_MILLION = 3.0
OUTPUT_COST_PER_MILLION = 15.0
def record_usage(self, input_tokens: int, output_tokens: int) -> bool:
"""Returns False if budget exceeded."""
call_cost = (
(input_tokens / 1_000_000) * self.INPUT_COST_PER_MILLION +
(output_tokens / 1_000_000) * self.OUTPUT_COST_PER_MILLION
)
self.current_cost_usd += call_cost
if self.current_cost_usd > self.max_cost_usd:
raise RuntimeError(
f"Agent budget exceeded: ${self.current_cost_usd:.4f} > ${self.max_cost_usd:.2f}"
)
return True
@property
def remaining_budget(self) -> float:
return self.max_cost_usd - self.current_cost_usd
Observability for Agents
Standard logging doesn’t capture agent behavior well. Use structured tracing:
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
tracer = trace.get_tracer("agent.service")
def traced_tool_call(tool_name: str, args: dict, result: any, duration_ms: float):
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.args", str(args)[:500])
span.set_attribute("tool.success", result is not None)
span.set_attribute("tool.duration_ms", duration_ms)
logger.info(
"tool_call",
tool=tool_name,
args_summary=str(args)[:200],
success=result is not None,
duration_ms=duration_ms,
)
Key metrics to track:
- Tool call count per task (anomaly detection for loops)
- Task completion rate (success / failure / human escalation breakdown)
- Cost per task (P50/P95/P99)
- Time to completion
- Human escalation rate (target: <5% for well-defined tasks)
Photo by Possessed Photography on Unsplash
The Human-in-the-Loop Pattern
For high-stakes actions, implement a proper approval workflow:
import asyncio
from enum import Enum
from typing import Callable
class ApprovalStatus(Enum):
PENDING = "pending"
APPROVED = "approved"
REJECTED = "rejected"
TIMEOUT = "timeout"
class HumanApprovalGate:
def __init__(self, timeout_seconds: int = 300):
self.timeout_seconds = timeout_seconds
self._pending: dict[str, asyncio.Future] = {}
async def request_approval(self, action_id: str, action_description: str, risk_level: str) -> ApprovalStatus:
"""Suspends agent execution until human responds."""
future = asyncio.get_event_loop().create_future()
self._pending[action_id] = future
# In production: send to Slack/email/dashboard
print(f"\n{'='*60}")
print(f"⚠️ APPROVAL REQUIRED [{risk_level.upper()}]")
print(f"Action: {action_description}")
print(f"Action ID: {action_id}")
print(f"Timeout: {self.timeout_seconds}s")
print(f"{'='*60}\n")
try:
result = await asyncio.wait_for(future, timeout=self.timeout_seconds)
return result
except asyncio.TimeoutError:
del self._pending[action_id]
return ApprovalStatus.TIMEOUT
def respond(self, action_id: str, approved: bool):
"""Called by human approval handler (webhook, UI, etc.)"""
if action_id in self._pending:
future = self._pending.pop(action_id)
status = ApprovalStatus.APPROVED if approved else ApprovalStatus.REJECTED
future.set_result(status)
Choosing the Right Model for Each Step
Not every step needs Claude Opus. A cost-efficient pattern:
# Routing model — cheap, fast, just decides what to do next
from langchain_anthropic import ChatAnthropic
router = ChatAnthropic(model="claude-haiku-4-5", max_tokens=256)
# Execution model — handles tool calls and reasoning
executor = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=2048)
# Synthesis model — final answer generation
synthesizer = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=4096)
# Complex multi-step reasoning — only when needed
deep_thinker = ChatAnthropic(model="claude-opus-4-5", max_tokens=8192)
Rule of thumb: use the cheapest model that can reliably do the job. Routing and simple classification rarely need Opus.
Key Takeaways
- Production agents fail due to systems engineering gaps, not model capability
- Guard against loops: max tool call limits are non-negotiable
- REQUIRES_APPROVAL list: classify tools by reversibility and flag dangerous ones
- Memory management: layer working, episodic, and semantic memory appropriately
- Cost tracking per task: set hard budget limits before starting an agent run
- Structured observability: log every tool call with arguments, results, and timing
- Human escalation: design the happy path for automation, but design the failure path for humans
The teams winning with AI agents in 2026 are the ones who treat agents as distributed systems, not just fancy chatbots.
References: LangGraph Documentation, Anthropic Tool Use Guide, Building Effective Agents — Anthropic Research
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
