Building Production Multi-Agent Systems with LangGraph
A deep dive into architecting and deploying multi-agent AI systems using LangGraph, with real examples from building Fyncall.
Muhammad Ali
AI Solutions Engineer & CTO
Introduction
After spending months building production AI systems, I've learned that single-agent architectures quickly hit their limits. Complex real-world tasks require specialization, coordination, and the ability to handle diverse subtasks efficiently. This is where multi-agent systems shine.
At Fyncall, we built a customer service platform that processes thousands of conversations daily using a sophisticated multi-agent architecture powered by LangGraph. In this post, I'll share the architecture patterns, implementation details, and hard-won lessons from building this system.
Why Multi-Agent Systems?
Before diving into implementation, let's understand why you might need a multi-agent system:
- Specialization: Different agents can be optimized for different tasks (research, coding, customer service, data analysis)
- Scalability: You can scale individual agents independently based on workload
- Reliability: If one agent fails, others can continue or retry
- Maintainability: Smaller, focused agents are easier to test and debug
- Cost Optimization: Route simple tasks to cheaper models, complex tasks to more capable ones
The key insight is that orchestration is harder than individual agent capability. A system of GPT-3.5 agents with excellent coordination often outperforms a single GPT-4 agent trying to do everything.
LangGraph Fundamentals
LangGraph is a library for building stateful, multi-actor applications with LLMs. The key concepts are:
1. State Graph
Everything in LangGraph revolves around a state graph. The state is a TypedDict or Pydantic model that gets passed between nodes:
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from operator import add
class AgentState(TypedDict):
messages: Annotated[list, add]
current_agent: str
task_result: str | None
iteration_count: int
2. Nodes
Nodes are functions that take the current state and return updates. Each node represents an agent or a processing step:
def router_node(state: AgentState) -> dict:
"""Routes tasks to the appropriate specialist agent."""
last_message = state["messages"][-1]
# Use an LLM to classify the intent
classification = classify_intent(last_message)
return {
"current_agent": classification.agent,
"messages": [AIMessage(content=f"Routing to {classification.agent}")]
}
3. Edges
Edges define the flow between nodes. LangGraph supports conditional edges for dynamic routing:
def should_continue(state: AgentState) -> str:
if state["task_result"]:
return "synthesizer"
if state["iteration_count"] > 5:
return "fallback"
return state["current_agent"]
graph.add_conditional_edges(
"router",
should_continue,
{
"researcher": "research_agent",
"coder": "coding_agent",
"synthesizer": "synthesis_agent",
"fallback": "human_handoff"
}
)
Architecture Patterns
Through building multiple production systems, I've identified several effective patterns:
Pattern 1: Supervisor Architecture
A central "supervisor" agent coordinates specialist agents. This is what we use at Fyncall:
┌─────────────────────────────────────────┐
│ SUPERVISOR │
│ (Routes, Monitors, Synthesizes) │
└─────────────┬───────────────────────────┘
│
┌─────────┼─────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Research│ │ Code │ │Customer│
│ Agent │ │ Agent │ │Service │
└───────┘ └───────┘ └───────┘
Pros: Clear hierarchy, easy to reason about, good for structured workflows
Cons: Supervisor can become a bottleneck, single point of failure
Pattern 2: Collaborative Swarm
Agents communicate peer-to-peer without a central coordinator:
┌───────┐ ┌───────┐
│Agent A│◄───►│Agent B│
└───┬───┘ └───┬───┘
│ │
▼ ▼
┌───────┐ ┌───────┐
│Agent C│◄───►│Agent D│
└───────┘ └───────┘
Pros: No single point of failure, emergent behavior
Cons: Harder to debug, can have coordination issues
Pattern 3: Hierarchical Teams
Teams of agents with their own supervisors, coordinated by a top-level orchestrator:
┌─────────────────────────────────────────┐
│ TOP ORCHESTRATOR │
└─────────────┬───────────────────────────┘
│
┌─────────┼─────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Research│ │ Dev │ │Support│
│ Team │ │ Team │ │ Team │
│ Lead │ │ Lead │ │ Lead │
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
┌─┼─┐ ┌─┼─┐ ┌─┼─┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
Agents Agents Agents
Building Fyncall: A Case Study
At Fyncall, we handle customer service for e-commerce businesses. Here's our actual architecture:
The Agent Roster
- Intent Classifier: Determines what the customer wants (order status, refund, product question, etc.)
- Context Gatherer: Pulls relevant customer data (order history, previous tickets, preferences)
- Policy Agent: Checks what actions are allowed based on business rules
- Response Drafter: Generates the actual customer response
- Quality Checker: Reviews responses for accuracy and tone
- Action Executor: Performs actual operations (issue refunds, update orders)
Implementation Details
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
class FyncallState(TypedDict):
messages: list
customer_id: str
order_context: dict | None
allowed_actions: list[str]
draft_response: str | None
quality_score: float
actions_to_execute: list[dict]
def build_fyncall_graph():
graph = StateGraph(FyncallState)
# Add nodes
graph.add_node("classifier", intent_classifier)
graph.add_node("context", context_gatherer)
graph.add_node("policy", policy_checker)
graph.add_node("drafter", response_drafter)
graph.add_node("quality", quality_checker)
graph.add_node("executor", action_executor)
# Define flow
graph.set_entry_point("classifier")
graph.add_edge("classifier", "context")
graph.add_edge("context", "policy")
graph.add_edge("policy", "drafter")
graph.add_edge("drafter", "quality")
# Conditional: only execute if quality is high enough
graph.add_conditional_edges(
"quality",
lambda s: "executor" if s["quality_score"] > 0.8 else "drafter",
{"executor": "executor", "drafter": "drafter"}
)
graph.add_edge("executor", END)
# Use PostgreSQL for persistence
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
return graph.compile(checkpointer=checkpointer)
Key Design Decisions
- Persistent State: We use PostgreSQL to checkpoint conversations, allowing resumption after failures
- Tool Integration: Each agent has access to specific tools (database queries, API calls, email sending)
- Observability: Every agent call is logged to our observability stack with LangSmith
- Human Escalation: Quality scores below threshold trigger human review
Production Considerations
1. Error Handling
In production, things fail. A lot. Here's how we handle it:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def resilient_agent_call(agent, state):
try:
return await agent.ainvoke(state)
except RateLimitError:
# Switch to backup model
return await backup_agent.ainvoke(state)
except Exception as e:
logger.error(f"Agent failed: {e}")
raise
2. Cost Optimization
Not every task needs GPT-4. We use a tiered approach:
- Classification/Routing: GPT-3.5-turbo or Claude Haiku (fast, cheap)
- Complex Reasoning: GPT-4 or Claude Sonnet
- Quality Checking: GPT-4 (accuracy matters most here)
3. Latency
Multi-agent systems can be slow. Our optimizations:
- Parallel Execution: Run independent agents concurrently
- Caching: Cache common classifications and context lookups
- Streaming: Stream partial results to the frontend
- Async Everything: Use async/await throughout the stack
4. Testing
Testing multi-agent systems is notoriously hard. Our approach:
import pytest
from unittest.mock import AsyncMock
@pytest.fixture
def mock_llm():
llm = AsyncMock()
llm.ainvoke.return_value = AIMessage(content="Test response")
return llm
async def test_refund_flow(mock_llm):
state = {
"messages": [HumanMessage(content="I want a refund")],
"customer_id": "test_123",
"order_context": {"order_id": "ORD_456", "total": 99.99}
}
result = await graph.ainvoke(state, config={"llm": mock_llm})
assert "refund" in result["actions_to_execute"][0]["type"]
assert result["quality_score"] > 0.8
Lessons Learned
After 6 months of production operation, here are my key takeaways:
1. Start Simple, Add Complexity
We started with 3 agents and grew to 6. Don't over-engineer upfront. Start with the minimum viable multi-agent system and add specialization as you identify bottlenecks.
2. Observability is Non-Negotiable
You need to see exactly what each agent is doing, what it's receiving, and what it's producing. LangSmith has been invaluable for this.
3. Human-in-the-Loop is Essential
No matter how good your agents are, you need human oversight. We have a dashboard where human reviewers can:
- Override agent decisions
- Add feedback to improve prompts
- Handle edge cases agents can't
4. Prompts Are Your Code
Treat prompts like code: version them, test them, review changes. A small prompt change can have massive downstream effects.
5. The Router is the Most Important Agent
If your router misclassifies, everything downstream fails. We spend the most time optimizing our router accuracy.
Conclusion
Building production multi-agent systems with LangGraph has been one of the most challenging and rewarding engineering experiences of my career. The key is to start simple, instrument everything, and iterate based on real user feedback.
If you're building something similar or have questions about multi-agent architectures, I'd love to chat. Reach out on LinkedIn or email me.
The future of AI isn't single models doing everything—it's orchestrated systems of specialized agents working together. And that future is here.
This post is based on my experience building Fyncall, an AI customer service platform with 100K+ lines of code and 150+ API endpoints. The system processes thousands of customer conversations daily using the architecture described above.