← Back to Agentic Workflow Guide

Chapter 6 Validation & Error Handling

Making workflows robust against failure

LLMs are probabilistic. Tools can fail. APIs time out. Networks drop. An agentic workflow that works on the “happy path” but falls over at the first unexpected event is not production-ready.

This chapter covers the defensive design patterns that make workflows reliable: input validation, output validation, iteration limits, error handling in tools, and graceful degradation.

6.1 — What Can Go Wrong

Before discussing solutions, let’s catalog the failure modes:

Failure ModeExampleImpact
Runaway tool loops Agent keeps calling tools without converging Infinite execution, unbounded cost
Wrong tool selection Agent calls delete_file instead of read_file Incorrect or destructive actions
Tool execution failure API returns 500, file not found, timeout Agent gets error instead of data
Invalid routing Router hallucinate a target node that doesn’t exist Execution goes to wrong place or crashes
State corruption Node writes wrong type to a state field Downstream nodes crash on type mismatch
Context overflow Too many messages fill the context window LLM loses early context, makes poor decisions
LLM refusal or hallucination Model refuses to use a tool, invents fake output Workflow proceeds with incorrect data

6.2 — Iteration Limits: The Circuit Breaker

The iteration limit is the single most important safeguard in any agentic workflow. Every LLM Node with tools has a max_tool_iterations setting that caps how many times the tool loop can execute.

How It Works

# In the agent configuration: max_tool_iterations: 15 iteration_warning_message: "You are running low on tool calls. Wrap up your analysis and return a final answer." # In the generated Tool Node code: def tool_2_node(global_state): current_iteration = global_state.get("node_2_tool_iteration_count", 0) # ── Hard limit: force stop ── if current_iteration >= 15: return Command( update={ "messages": [HumanMessage( content="You are out of tool calls. Now, based on everything you have analyzed, return the final output." )] }, goto="llm_2_node" # Force LLM to give final answer ) # ── Soft warning: nudge toward completion ── if current_iteration >= 12: # ~80% of limit warning = HumanMessage( content="You are running low on tool calls. Wrap up your analysis and return a final answer." ) # Warning is appended to messages alongside tool results # ── Normal: execute tool, increment counter ── result = execute_tool(tool_call) return Command( update={ "node_2_tool_iteration_count": current_iteration + 1, "messages": [ToolMessage(content=result)] }, goto="llm_2_node" )

Choosing Iteration Limits

Agent TypeSuggested LimitRationale
Classifier (no tools) 0 No tools, no loop
Simple lookup agent 5–10 Query a database, format result
Investigation agent 15–30 Multiple data sources, iterative analysis
CTF exploit agent 30–50 Trial-and-error, multiple attempts
Aggressively autonomous agent 50–100 Long-running tasks, but monitor closely
Rule of thumb: Start with 15. Watch actual behavior. If the agent consistently hits the limit before finishing, increase it. If the agent finishes in 3 calls, lower it to avoid wasting tokens on a runaway case.

6.3 — Tool Error Handling

When a tool fails, the LLM should know what failed and why, so it can decide whether to retry, try a different approach, or give up gracefully. Never let exceptions bubble up silently.

Pattern: Structured Error Returns

@tool def check_ip_reputation(ip_address: str, state: dict = None) -> dict: """Check an IP address against threat intelligence databases.""" try: result = threat_intel_api.lookup(ip_address) return { "success": True, "ip": ip_address, "reputation": result.score, "tags": result.tags, "error": None } except ConnectionError: return { "success": False, "ip": ip_address, "reputation": None, "tags": [], "error": "Could not reach threat intel API. Service may be down." } except ValueError as e: return { "success": False, "ip": ip_address, "reputation": None, "tags": [], "error": f"Invalid IP address format: {str(e)}" }

When the LLM sees "success": false and a clear error message, it can reason about what to do next: retry, skip, or try an alternative tool.

Pattern: Tool Not Found

If the LLM requests a tool that doesn’t exist in the node’s tool set, the Tool Node returns an error message rather than crashing:

# Generated Tool Node handles unknown tools: def tool_2_node(global_state): tool_name = tool_call["name"] tool_func = tools_by_name_for_node_2.get(tool_name) if tool_func is None: result = f"Error: Tool '{tool_name}' not found. " f"Available tools: {list(tools_by_name_for_node_2.keys())}" else: result = tool_func.invoke(tool_call["args"])

6.4 — Router Validation

What if the Router LLM returns a next_node value that doesn’t match any of the connected targets?

# Router validation in generated code: def router_3_node(global_state): decision = model_with_schema.invoke(messages) # Validate the decision available_targets = ["llm_4_node", "llm_5_node", "llm_6_node"] if decision.next_node in available_targets: # Valid choice return Command( update={"routing_reason": decision.reason}, goto=decision.next_node ) else: # Invalid choice — fallback to first target + warning print(f"⚠️ Router chose invalid target '{decision.next_node}'. " f"Falling back to '{available_targets[0]}'") return Command( update={"routing_reason": f"FALLBACK: {decision.reason}"}, goto=available_targets[0] )

This ensures the workflow never crashes due to a routing hallucination. The fallback is safe (always a valid node), and the warning is logged so you can improve the router’s prompt.

6.5 — Structured Output Validation

When you configure an LLM Node to use structured output (a Pydantic schema), you’re asking the LLM to return JSON matching that schema. This provides type-level validation automatically:

# Define expected output structure: class AlertClassification(BaseModel): category: str # "malware", "intrusion", "misconfig" severity: str # "critical", "high", "medium", "low" confidence: float # 0.0 to 1.0 reasoning: str # Why this classification # LangChain ensures the LLM's output matches the schema. # If the LLM returns {"severity": 42} instead of a string, # Pydantic validation catches it. model_with_output = model.with_structured_output(AlertClassification) result = model_with_output.invoke(messages) # result.category → str ✓ # result.severity → str ✓ # result.confidence → float ✓
When to use structured output: Use it whenever downstream logic depends on specific fields. If another agent’s prompt uses {classification}, make sure the upstream agent outputs a structured result with a classification field — don’t rely on parsing free text.

6.6 — State Type Safety

When your shared state has typed fields, you get type-level safety. Common type mismatches to watch for:

MistakeSymptomFix
Writing "5" to an int field TypeError on arithmetic downstream Ensure tools return correct types
Writing a single message to a List field TypeError: 'AIMessage' is not iterable Wrap in a list: [message]
Forgetting to initialize a field KeyError when reading Use .get(field, default)

When using a typed state schema, the framework can catch type mismatches at compile time. As long as your tools return the right types, the system stays consistent.

6.7 — Defense in Depth Checklist

Before deploying a workflow, verify these safeguards are in place:

LayerSafeguardHow to Configure
Tool loops max_tool_iterations Set per-agent in the workflow editor (default: 30)
Tool loops Iteration warning message Custom message at ~80% of limit
Tool failures Try/except in tool implementations Return {"success": false, "error": "..."}
Routing Router fallback validation Auto-generated by the framework
Output format Structured output (Pydantic) Enable in node config + define schema
State types TypedDict + reducers Define schema carefully in editor
Graph cycles Termination conditions Router + max iteration counter

6.8 — Compile-Time Topology Validation

Before a workflow runs, the system should validate the graph structure to catch problems that would manifest as runtime errors (infinite loops, unreachable nodes, etc.) Good workflow frameworks run these checks automatically at compile time or export time.

CheckWhat It CatchesSeverity
Entry node has outgoing edges Empty graph (entry goes nowhere) Error
Terminal LLM node exists No LLM node without outgoing edges — graph can never reach END Error
No orphan LLM/Router nodes Nodes with no incoming edge are unreachable Error
Workers connected to LLM nodes Disconnected workers will never be called Error
Router fanout ≥ 2 Single-output routers are pointless Error
Loop target has loop_mode LLM nodes with multiple incoming edges and no loop_mode configured Error
Cycle with no Router Guaranteed infinite loop (no decision point to exit) Error
Cycle with Router but no exit Router in cycle has all edges within the cycle Error
Cycle with Router + exit path Valid loop, but worth flagging for awareness Warning
Continue mode with no feedback key Loop re-entry with accumulated history but no new context Warning

Errors should block compilation. Warnings allow compilation but indicate potential issues worth investigating.

Chapter Summary

Key Takeaways:
← Chapter 5: Information Flow Chapter 7: Putting It All Together →