Building Production AI Agents in 2026: Tool Use, Orchestration, and Reliability at Scale



Building Production AI Agents in 2026: Tool Use, Orchestration, and Reliability at Scale

AI agents broke into mainstream consciousness in 2024. By 2025, every enterprise had pilots running. In 2026, the question is no longer “should we build agents?” — it’s “why do our agents keep failing in production, and how do we fix that?”

This is the post for teams who have moved beyond demos and are wrestling with the hard problems: reliability, observability, cost, and knowing when to hand back control to a human.

Interconnected network nodes representing AI agent orchestration and automation Photo by Google DeepMind on Unsplash


What Actually Goes Wrong with AI Agents

Before solutions, let’s be honest about the failure modes:

  1. Tool call loops — the agent keeps calling the same tool expecting different results
  2. Context window exhaustion — long conversations cause the model to lose track of earlier state
  3. Hallucinated tool arguments — the model invents parameters that don’t exist
  4. Irreversible actions — the agent deletes a record or sends an email before you realize it misunderstood
  5. Cost explosions — a looping agent burns $500 in API calls overnight
  6. Silent failures — the agent returns a confident answer based on a failed tool call it didn’t notice

Addressing these isn’t a prompt engineering problem. It’s a systems engineering problem.


The Architecture That Actually Works

The most reliable production agent architecture in 2026 follows this pattern:

┌─────────────────────────────────────────────────────┐
│                   Agent Orchestrator                 │
│                                                      │
│  ┌──────────┐    ┌──────────┐    ┌──────────────┐   │
│  │ Planning │ →  │ Tool     │ →  │ Verification │   │
│  │  Loop    │    │ Executor │    │   & Guard    │   │
│  └──────────┘    └──────────┘    └──────────────┘   │
│        ↑               │                 │           │
│        └───────────────┘                 │           │
│                                          ↓           │
│  ┌────────────┐    ┌──────────────────────────────┐  │
│  │  Memory    │    │    Human Escalation Queue     │  │
│  │  (short +  │    │  (ambiguous / risky actions)  │  │
│  │  long-term)│    └──────────────────────────────┘  │
│  └────────────┘                                      │
└─────────────────────────────────────────────────────┘

Key principle: the orchestrator controls the agent, not the other way around. The model decides what to do; the orchestrator decides whether to allow it.


Implementing a Reliable Agent with LangGraph

LangGraph (from LangChain) has become the standard for stateful, graph-based agent workflows. Here’s a production-grade implementation:

Setup

pip install langgraph langchain-anthropic langchain-core

Define Tools with Strict Schemas

from langchain_core.tools import tool
from pydantic import BaseModel, Field
from typing import Literal

class SearchInput(BaseModel):
    query: str = Field(description="The search query, max 200 chars")
    max_results: int = Field(default=5, ge=1, le=20)

class DatabaseQueryInput(BaseModel):
    table: Literal["users", "orders", "products"] = Field(
        description="Table to query — only these three are allowed"
    )
    filter_field: str = Field(description="Column name to filter on")
    filter_value: str = Field(description="Value to filter by")
    limit: int = Field(default=10, ge=1, le=100)

@tool("web_search", args_schema=SearchInput)
def web_search(query: str, max_results: int = 5) -> list[dict]:
    """Search the web for current information."""
    # Real implementation would call Brave/Tavily/etc.
    return [{"title": f"Result for {query}", "url": "https://example.com", "snippet": "..."}]

@tool("query_database", args_schema=DatabaseQueryInput)
def query_database(table: str, filter_field: str, filter_value: str, limit: int = 10) -> list[dict]:
    """Query the internal database. Only specific tables are accessible."""
    # Parameterized query — no SQL injection possible
    return []

@tool("send_email")
def send_email(to: str, subject: str, body: str) -> dict:
    """Send an email. CAUTION: This action is irreversible."""
    # This tool should always require human approval
    return {"status": "sent", "message_id": "mock-123"}

Build the Agent Graph

import anthropic
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_anthropic import ChatAnthropic
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    tool_call_count: int
    requires_human_approval: bool
    task_complete: bool

# High-capability model for planning
planner_model = ChatAnthropic(
    model="claude-opus-4-5",
    max_tokens=4096,
).bind_tools([web_search, query_database, send_email])

# Define which tools require human approval before execution
REQUIRES_APPROVAL = {"send_email"}
MAX_TOOL_CALLS = 20  # Guard against infinite loops

def should_continue(state: AgentState) -> Literal["tools", "human_review", "end"]:
    messages = state["messages"]
    last_message = messages[-1]

    # Check for completion
    if not last_message.tool_calls:
        return "end"

    # Guard: too many tool calls
    if state["tool_call_count"] >= MAX_TOOL_CALLS:
        return "end"

    # Check if any called tool requires approval
    called_tools = {tc["name"] for tc in last_message.tool_calls}
    if called_tools & REQUIRES_APPROVAL:
        return "human_review"

    return "tools"

def call_model(state: AgentState) -> AgentState:
    response = planner_model.invoke(state["messages"])
    return {
        "messages": [response],
        "tool_call_count": state["tool_call_count"] + len(getattr(response, "tool_calls", []) or []),
        "requires_human_approval": False,
        "task_complete": not bool(getattr(response, "tool_calls", None)),
    }

def request_human_review(state: AgentState) -> AgentState:
    """Pause execution and flag for human review."""
    last_message = state["messages"][-1]
    pending_calls = [tc["name"] for tc in last_message.tool_calls]

    print(f"\n⚠️  HUMAN APPROVAL REQUIRED")
    print(f"Pending tool calls: {pending_calls}")
    print(f"Last message: {last_message.content}")

    # In production: push to approval queue, suspend execution
    # Here we simulate rejection for demo
    return {
        "messages": state["messages"],
        "requires_human_approval": True,
        "task_complete": True,
        "tool_call_count": state["tool_call_count"],
    }

# Build the graph
tools = [web_search, query_database, send_email]
tool_node = ToolNode(tools)

workflow = StateGraph(AgentState)
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)
workflow.add_node("human_review", request_human_review)

workflow.set_entry_point("agent")
workflow.add_conditional_edges(
    "agent",
    should_continue,
    {
        "tools": "tools",
        "human_review": "human_review",
        "end": END,
    }
)
workflow.add_edge("tools", "agent")
workflow.add_edge("human_review", END)

agent = workflow.compile()

Run with Observability

from langchain_core.messages import HumanMessage
import json

def run_agent(task: str) -> dict:
    initial_state = {
        "messages": [HumanMessage(content=task)],
        "tool_call_count": 0,
        "requires_human_approval": False,
        "task_complete": False,
    }

    print(f"🤖 Starting agent for task: {task[:100]}...")

    final_state = agent.invoke(initial_state)

    result = {
        "task": task,
        "tool_calls_made": final_state["tool_call_count"],
        "required_human": final_state["requires_human_approval"],
        "complete": final_state["task_complete"],
        "response": final_state["messages"][-1].content,
    }

    print(f"✅ Complete. Tool calls: {result['tool_calls_made']}, Human escalation: {result['required_human']}")
    return result

# Example
result = run_agent("Find the top 3 competitors to our product in the enterprise SaaS space and summarize their pricing models")

Memory: The Missing Layer

Most agent failures trace back to poor memory management. Here’s a practical layered approach:

from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class AgentMemory:
    # In-context memory (lost after session)
    working_memory: list[dict] = field(default_factory=list)

    # Episodic memory (recent sessions, fast retrieval)
    episodic_store: dict = field(default_factory=dict)  # In prod: Redis

    # Semantic memory (vector DB for long-term knowledge)
    # semantic_store: VectorDB  # In prod: Pinecone, Weaviate, pgvector

    def add_observation(self, key: str, value: str, importance: float = 0.5):
        """Store an observation with importance score for later pruning."""
        self.working_memory.append({
            "key": key,
            "value": value,
            "importance": importance,
            "timestamp": __import__("time").time(),
        })
        # Prune low-importance items if working memory is large
        if len(self.working_memory) > 50:
            self.working_memory.sort(key=lambda x: x["importance"], reverse=True)
            self.working_memory = self.working_memory[:30]

    def get_context_summary(self) -> str:
        """Compress working memory into a context string for the next LLM call."""
        if not self.working_memory:
            return "No prior context."

        items = sorted(self.working_memory, key=lambda x: x["importance"], reverse=True)[:10]
        lines = [f"- {item['key']}: {item['value']}" for item in items]
        return "Key context:\n" + "\n".join(lines)

Cost Control: Essential for Production

Without guardrails, agents can burn your budget overnight:

import anthropic
from dataclasses import dataclass

@dataclass
class CostTracker:
    max_cost_usd: float = 1.0
    current_cost_usd: float = 0.0

    # Claude Sonnet 4.5 pricing (approximate)
    INPUT_COST_PER_MILLION = 3.0
    OUTPUT_COST_PER_MILLION = 15.0

    def record_usage(self, input_tokens: int, output_tokens: int) -> bool:
        """Returns False if budget exceeded."""
        call_cost = (
            (input_tokens / 1_000_000) * self.INPUT_COST_PER_MILLION +
            (output_tokens / 1_000_000) * self.OUTPUT_COST_PER_MILLION
        )
        self.current_cost_usd += call_cost

        if self.current_cost_usd > self.max_cost_usd:
            raise RuntimeError(
                f"Agent budget exceeded: ${self.current_cost_usd:.4f} > ${self.max_cost_usd:.2f}"
            )
        return True

    @property
    def remaining_budget(self) -> float:
        return self.max_cost_usd - self.current_cost_usd

Observability for Agents

Standard logging doesn’t capture agent behavior well. Use structured tracing:

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
tracer = trace.get_tracer("agent.service")

def traced_tool_call(tool_name: str, args: dict, result: any, duration_ms: float):
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.args", str(args)[:500])
        span.set_attribute("tool.success", result is not None)
        span.set_attribute("tool.duration_ms", duration_ms)

    logger.info(
        "tool_call",
        tool=tool_name,
        args_summary=str(args)[:200],
        success=result is not None,
        duration_ms=duration_ms,
    )

Key metrics to track:

  • Tool call count per task (anomaly detection for loops)
  • Task completion rate (success / failure / human escalation breakdown)
  • Cost per task (P50/P95/P99)
  • Time to completion
  • Human escalation rate (target: <5% for well-defined tasks)

Abstract visualization of data flows and neural network connections Photo by Possessed Photography on Unsplash

The Human-in-the-Loop Pattern

For high-stakes actions, implement a proper approval workflow:

import asyncio
from enum import Enum
from typing import Callable

class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"
    TIMEOUT = "timeout"

class HumanApprovalGate:
    def __init__(self, timeout_seconds: int = 300):
        self.timeout_seconds = timeout_seconds
        self._pending: dict[str, asyncio.Future] = {}

    async def request_approval(self, action_id: str, action_description: str, risk_level: str) -> ApprovalStatus:
        """Suspends agent execution until human responds."""
        future = asyncio.get_event_loop().create_future()
        self._pending[action_id] = future

        # In production: send to Slack/email/dashboard
        print(f"\n{'='*60}")
        print(f"⚠️  APPROVAL REQUIRED [{risk_level.upper()}]")
        print(f"Action: {action_description}")
        print(f"Action ID: {action_id}")
        print(f"Timeout: {self.timeout_seconds}s")
        print(f"{'='*60}\n")

        try:
            result = await asyncio.wait_for(future, timeout=self.timeout_seconds)
            return result
        except asyncio.TimeoutError:
            del self._pending[action_id]
            return ApprovalStatus.TIMEOUT

    def respond(self, action_id: str, approved: bool):
        """Called by human approval handler (webhook, UI, etc.)"""
        if action_id in self._pending:
            future = self._pending.pop(action_id)
            status = ApprovalStatus.APPROVED if approved else ApprovalStatus.REJECTED
            future.set_result(status)

Choosing the Right Model for Each Step

Not every step needs Claude Opus. A cost-efficient pattern:

# Routing model — cheap, fast, just decides what to do next
from langchain_anthropic import ChatAnthropic

router = ChatAnthropic(model="claude-haiku-4-5", max_tokens=256)

# Execution model — handles tool calls and reasoning
executor = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=2048)

# Synthesis model — final answer generation
synthesizer = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=4096)

# Complex multi-step reasoning — only when needed
deep_thinker = ChatAnthropic(model="claude-opus-4-5", max_tokens=8192)

Rule of thumb: use the cheapest model that can reliably do the job. Routing and simple classification rarely need Opus.


Key Takeaways

  • Production agents fail due to systems engineering gaps, not model capability
  • Guard against loops: max tool call limits are non-negotiable
  • REQUIRES_APPROVAL list: classify tools by reversibility and flag dangerous ones
  • Memory management: layer working, episodic, and semantic memory appropriately
  • Cost tracking per task: set hard budget limits before starting an agent run
  • Structured observability: log every tool call with arguments, results, and timing
  • Human escalation: design the happy path for automation, but design the failure path for humans

The teams winning with AI agents in 2026 are the ones who treat agents as distributed systems, not just fancy chatbots.


References: LangGraph Documentation, Anthropic Tool Use Guide, Building Effective Agents — Anthropic Research

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)