Building Production Multi-Agent Systems with LangGraph

Introduction

After spending months building production AI systems, I've learned that single-agent architectures quickly hit their limits. Complex real-world tasks require specialization, coordination, and the ability to handle diverse subtasks efficiently. This is where multi-agent systems shine.

At Fyncall, we built a customer service platform that processes thousands of conversations daily using a sophisticated multi-agent architecture powered by LangGraph. In this post, I'll share the architecture patterns, implementation details, and hard-won lessons from building this system.

Why Multi-Agent Systems?

Before diving into implementation, let's understand why you might need a multi-agent system:

Specialization: Different agents can be optimized for different tasks (research, coding, customer service, data analysis)
Scalability: You can scale individual agents independently based on workload
Reliability: If one agent fails, others can continue or retry
Maintainability: Smaller, focused agents are easier to test and debug
Cost Optimization: Route simple tasks to cheaper models, complex tasks to more capable ones

The key insight is that orchestration is harder than individual agent capability. A system of GPT-3.5 agents with excellent coordination often outperforms a single GPT-4 agent trying to do everything.

LangGraph Fundamentals

LangGraph is a library for building stateful, multi-actor applications with LLMs. The key concepts are:

1. State Graph

Everything in LangGraph revolves around a state graph. The state is a TypedDict or Pydantic model that gets passed between nodes:

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from operator import add

class AgentState(TypedDict):
    messages: Annotated[list, add]
    current_agent: str
    task_result: str | None
    iteration_count: int

2. Nodes

Nodes are functions that take the current state and return updates. Each node represents an agent or a processing step:

def router_node(state: AgentState) -> dict:
    """Routes tasks to the appropriate specialist agent."""
    last_message = state["messages"][-1]

    # Use an LLM to classify the intent
    classification = classify_intent(last_message)

    return {
        "current_agent": classification.agent,
        "messages": [AIMessage(content=f"Routing to {classification.agent}")]
    }

3. Edges

Edges define the flow between nodes. LangGraph supports conditional edges for dynamic routing:

def should_continue(state: AgentState) -> str:
    if state["task_result"]:
        return "synthesizer"
    if state["iteration_count"] > 5:
        return "fallback"
    return state["current_agent"]

graph.add_conditional_edges(
    "router",
    should_continue,
    {
        "researcher": "research_agent",
        "coder": "coding_agent",
        "synthesizer": "synthesis_agent",
        "fallback": "human_handoff"
    }
)

Architecture Patterns

Through building multiple production systems, I've identified several effective patterns:

Pattern 1: Supervisor Architecture

A central "supervisor" agent coordinates specialist agents. This is what we use at Fyncall:

┌─────────────────────────────────────────┐
│              SUPERVISOR                  │
│    (Routes, Monitors, Synthesizes)       │
└─────────────┬───────────────────────────┘
              │
    ┌─────────┼─────────┐
    ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐
│Research│ │ Code  │ │Customer│
│ Agent  │ │ Agent │ │Service │
└───────┘ └───────┘ └───────┘

Pros: Clear hierarchy, easy to reason about, good for structured workflows
Cons: Supervisor can become a bottleneck, single point of failure

Pattern 2: Collaborative Swarm

Agents communicate peer-to-peer without a central coordinator:

┌───────┐     ┌───────┐
│Agent A│◄───►│Agent B│
└───┬───┘     └───┬───┘
    │             │
    ▼             ▼
┌───────┐     ┌───────┐
│Agent C│◄───►│Agent D│
└───────┘     └───────┘

Pros: No single point of failure, emergent behavior
Cons: Harder to debug, can have coordination issues

Pattern 3: Hierarchical Teams

Teams of agents with their own supervisors, coordinated by a top-level orchestrator:

┌─────────────────────────────────────────┐
│           TOP ORCHESTRATOR              │
└─────────────┬───────────────────────────┘
              │
    ┌─────────┼─────────┐
    ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐
│Research│ │ Dev   │ │Support│
│  Team  │ │ Team  │ │ Team  │
│ Lead   │ │ Lead  │ │ Lead  │
└───┬───┘ └───┬───┘ └───┬───┘
    │         │         │
  ┌─┼─┐     ┌─┼─┐     ┌─┼─┐
  ▼ ▼ ▼     ▼ ▼ ▼     ▼ ▼ ▼
 Agents    Agents    Agents

Building Fyncall: A Case Study

At Fyncall, we handle customer service for e-commerce businesses. Here's our actual architecture:

The Agent Roster

Intent Classifier: Determines what the customer wants (order status, refund, product question, etc.)
Context Gatherer: Pulls relevant customer data (order history, previous tickets, preferences)
Policy Agent: Checks what actions are allowed based on business rules
Response Drafter: Generates the actual customer response
Quality Checker: Reviews responses for accuracy and tone
Action Executor: Performs actual operations (issue refunds, update orders)

Implementation Details

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

class FyncallState(TypedDict):
    messages: list
    customer_id: str
    order_context: dict | None
    allowed_actions: list[str]
    draft_response: str | None
    quality_score: float
    actions_to_execute: list[dict]

def build_fyncall_graph():
    graph = StateGraph(FyncallState)

    # Add nodes
    graph.add_node("classifier", intent_classifier)
    graph.add_node("context", context_gatherer)
    graph.add_node("policy", policy_checker)
    graph.add_node("drafter", response_drafter)
    graph.add_node("quality", quality_checker)
    graph.add_node("executor", action_executor)

    # Define flow
    graph.set_entry_point("classifier")
    graph.add_edge("classifier", "context")
    graph.add_edge("context", "policy")
    graph.add_edge("policy", "drafter")
    graph.add_edge("drafter", "quality")

    # Conditional: only execute if quality is high enough
    graph.add_conditional_edges(
        "quality",
        lambda s: "executor" if s["quality_score"] > 0.8 else "drafter",
        {"executor": "executor", "drafter": "drafter"}
    )

    graph.add_edge("executor", END)

    # Use PostgreSQL for persistence
    checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)

    return graph.compile(checkpointer=checkpointer)

Key Design Decisions

Persistent State: We use PostgreSQL to checkpoint conversations, allowing resumption after failures
Tool Integration: Each agent has access to specific tools (database queries, API calls, email sending)
Observability: Every agent call is logged to our observability stack with LangSmith
Human Escalation: Quality scores below threshold trigger human review

Production Considerations

1. Error Handling

In production, things fail. A lot. Here's how we handle it:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def resilient_agent_call(agent, state):
    try:
        return await agent.ainvoke(state)
    except RateLimitError:
        # Switch to backup model
        return await backup_agent.ainvoke(state)
    except Exception as e:
        logger.error(f"Agent failed: {e}")
        raise

2. Cost Optimization

Not every task needs GPT-4. We use a tiered approach:

Classification/Routing: GPT-3.5-turbo or Claude Haiku (fast, cheap)
Complex Reasoning: GPT-4 or Claude Sonnet
Quality Checking: GPT-4 (accuracy matters most here)

3. Latency

Multi-agent systems can be slow. Our optimizations:

Parallel Execution: Run independent agents concurrently
Caching: Cache common classifications and context lookups
Streaming: Stream partial results to the frontend
Async Everything: Use async/await throughout the stack

4. Testing

Testing multi-agent systems is notoriously hard. Our approach:

import pytest
from unittest.mock import AsyncMock

@pytest.fixture
def mock_llm():
    llm = AsyncMock()
    llm.ainvoke.return_value = AIMessage(content="Test response")
    return llm

async def test_refund_flow(mock_llm):
    state = {
        "messages": [HumanMessage(content="I want a refund")],
        "customer_id": "test_123",
        "order_context": {"order_id": "ORD_456", "total": 99.99}
    }

    result = await graph.ainvoke(state, config={"llm": mock_llm})

    assert "refund" in result["actions_to_execute"][0]["type"]
    assert result["quality_score"] > 0.8

Lessons Learned

After 6 months of production operation, here are my key takeaways:

1. Start Simple, Add Complexity

We started with 3 agents and grew to 6. Don't over-engineer upfront. Start with the minimum viable multi-agent system and add specialization as you identify bottlenecks.

2. Observability is Non-Negotiable

You need to see exactly what each agent is doing, what it's receiving, and what it's producing. LangSmith has been invaluable for this.

3. Human-in-the-Loop is Essential

No matter how good your agents are, you need human oversight. We have a dashboard where human reviewers can:

Override agent decisions
Add feedback to improve prompts
Handle edge cases agents can't

4. Prompts Are Your Code

Treat prompts like code: version them, test them, review changes. A small prompt change can have massive downstream effects.

5. The Router is the Most Important Agent

If your router misclassifies, everything downstream fails. We spend the most time optimizing our router accuracy.

Conclusion

Building production multi-agent systems with LangGraph has been one of the most challenging and rewarding engineering experiences of my career. The key is to start simple, instrument everything, and iterate based on real user feedback.

If you're building something similar or have questions about multi-agent architectures, I'd love to chat. Reach out on LinkedIn or email me.

The future of AI isn't single models doing everything—it's orchestrated systems of specialized agents working together. And that future is here.

This post is based on my experience building Fyncall, an AI customer service platform with 100K+ lines of code and 150+ API endpoints. The system processes thousands of customer conversations daily using the architecture described above.

Building Production Multi-Agent Systems with LangGraph

Introduction

Why Multi-Agent Systems?

LangGraph Fundamentals

1. State Graph

2. Nodes

3. Edges

Architecture Patterns

Pattern 1: Supervisor Architecture

Pattern 2: Collaborative Swarm

Pattern 3: Hierarchical Teams

Building Fyncall: A Case Study

The Agent Roster

Implementation Details

Key Design Decisions

Production Considerations

1. Error Handling

2. Cost Optimization

3. Latency

4. Testing

Lessons Learned

1. Start Simple, Add Complexity

2. Observability is Non-Negotiable

3. Human-in-the-Loop is Essential

4. Prompts Are Your Code

5. The Router is the Most Important Agent

Conclusion

Written by Muhammad Ali

Ali's AI Assistant

Introduction

Why Multi-Agent Systems?

LangGraph Fundamentals

1. State Graph

2. Nodes

3. Edges

Architecture Patterns

Pattern 1: Supervisor Architecture

Pattern 2: Collaborative Swarm

Pattern 3: Hierarchical Teams

Building Fyncall: A Case Study

The Agent Roster

Implementation Details

Key Design Decisions

Production Considerations

1. Error Handling

2. Cost Optimization

3. Latency

4. Testing

Lessons Learned

1. Start Simple, Add Complexity

2. Observability is Non-Negotiable

3. Human-in-the-Loop is Essential

4. Prompts Are Your Code

5. The Router is the Most Important Agent

Conclusion

Written by Muhammad Ali

Ali's AI Assistant

Konami Code Activated!