← Back to Agentic Workflow Guide
Chapter 6 Validation & Error Handling
Making workflows robust against failure
LLMs are probabilistic. Tools can fail. APIs time out. Networks drop. An agentic workflow that works on the “happy path” but falls over at the first unexpected event is not production-ready.
This chapter covers the defensive design patterns that make workflows reliable: input validation, output validation, iteration limits, error handling in tools, and graceful degradation.
6.1 — What Can Go Wrong
Before discussing solutions, let’s catalog the failure modes:
| Failure Mode | Example | Impact |
|---|---|---|
| Runaway tool loops | Agent keeps calling tools without converging | Infinite execution, unbounded cost |
| Wrong tool selection | Agent calls delete_file instead of read_file |
Incorrect or destructive actions |
| Tool execution failure | API returns 500, file not found, timeout | Agent gets error instead of data |
| Invalid routing | Router hallucinate a target node that doesn’t exist | Execution goes to wrong place or crashes |
| State corruption | Node writes wrong type to a state field | Downstream nodes crash on type mismatch |
| Context overflow | Too many messages fill the context window | LLM loses early context, makes poor decisions |
| LLM refusal or hallucination | Model refuses to use a tool, invents fake output | Workflow proceeds with incorrect data |
6.2 — Iteration Limits: The Circuit Breaker
The iteration limit is the single most important safeguard in any agentic workflow.
Every LLM Node with tools has a max_tool_iterations setting
that caps how many times the tool loop can execute.
How It Works
# In the agent configuration:
max_tool_iterations: 15
iteration_warning_message: "You are running low on tool calls.
Wrap up your analysis and return a final answer."
# In the generated Tool Node code:
def tool_2_node(global_state):
current_iteration = global_state.get("node_2_tool_iteration_count", 0)
# ── Hard limit: force stop ──
if current_iteration >= 15:
return Command(
update={
"messages": [HumanMessage(
content="You are out of tool calls. Now, based on everything
you have analyzed, return the final output."
)]
},
goto="llm_2_node" # Force LLM to give final answer
)
# ── Soft warning: nudge toward completion ──
if current_iteration >= 12: # ~80% of limit
warning = HumanMessage(
content="You are running low on tool calls.
Wrap up your analysis and return a final answer."
)
# Warning is appended to messages alongside tool results
# ── Normal: execute tool, increment counter ──
result = execute_tool(tool_call)
return Command(
update={
"node_2_tool_iteration_count": current_iteration + 1,
"messages": [ToolMessage(content=result)]
},
goto="llm_2_node"
)Choosing Iteration Limits
| Agent Type | Suggested Limit | Rationale |
|---|---|---|
| Classifier (no tools) | 0 | No tools, no loop |
| Simple lookup agent | 5–10 | Query a database, format result |
| Investigation agent | 15–30 | Multiple data sources, iterative analysis |
| CTF exploit agent | 30–50 | Trial-and-error, multiple attempts |
| Aggressively autonomous agent | 50–100 | Long-running tasks, but monitor closely |
6.3 — Tool Error Handling
When a tool fails, the LLM should know what failed and why, so it can decide whether to retry, try a different approach, or give up gracefully. Never let exceptions bubble up silently.
Pattern: Structured Error Returns
@tool
def check_ip_reputation(ip_address: str, state: dict = None) -> dict:
"""Check an IP address against threat intelligence databases."""
try:
result = threat_intel_api.lookup(ip_address)
return {
"success": True,
"ip": ip_address,
"reputation": result.score,
"tags": result.tags,
"error": None
}
except ConnectionError:
return {
"success": False,
"ip": ip_address,
"reputation": None,
"tags": [],
"error": "Could not reach threat intel API. Service may be down."
}
except ValueError as e:
return {
"success": False,
"ip": ip_address,
"reputation": None,
"tags": [],
"error": f"Invalid IP address format: {str(e)}"
}
When the LLM sees "success": false and a clear error message,
it can reason about what to do next: retry, skip, or try an alternative tool.
Pattern: Tool Not Found
If the LLM requests a tool that doesn’t exist in the node’s tool set, the Tool Node returns an error message rather than crashing:
# Generated Tool Node handles unknown tools:
def tool_2_node(global_state):
tool_name = tool_call["name"]
tool_func = tools_by_name_for_node_2.get(tool_name)
if tool_func is None:
result = f"Error: Tool '{tool_name}' not found. "
f"Available tools: {list(tools_by_name_for_node_2.keys())}"
else:
result = tool_func.invoke(tool_call["args"])6.4 — Router Validation
What if the Router LLM returns a next_node value that doesn’t
match any of the connected targets?
# Router validation in generated code:
def router_3_node(global_state):
decision = model_with_schema.invoke(messages)
# Validate the decision
available_targets = ["llm_4_node", "llm_5_node", "llm_6_node"]
if decision.next_node in available_targets:
# Valid choice
return Command(
update={"routing_reason": decision.reason},
goto=decision.next_node
)
else:
# Invalid choice — fallback to first target + warning
print(f"⚠️ Router chose invalid target '{decision.next_node}'. "
f"Falling back to '{available_targets[0]}'")
return Command(
update={"routing_reason": f"FALLBACK: {decision.reason}"},
goto=available_targets[0]
)This ensures the workflow never crashes due to a routing hallucination. The fallback is safe (always a valid node), and the warning is logged so you can improve the router’s prompt.
6.5 — Structured Output Validation
When you configure an LLM Node to use structured output (a Pydantic schema), you’re asking the LLM to return JSON matching that schema. This provides type-level validation automatically:
# Define expected output structure:
class AlertClassification(BaseModel):
category: str # "malware", "intrusion", "misconfig"
severity: str # "critical", "high", "medium", "low"
confidence: float # 0.0 to 1.0
reasoning: str # Why this classification
# LangChain ensures the LLM's output matches the schema.
# If the LLM returns {"severity": 42} instead of a string,
# Pydantic validation catches it.
model_with_output = model.with_structured_output(AlertClassification)
result = model_with_output.invoke(messages)
# result.category → str ✓
# result.severity → str ✓
# result.confidence → float ✓{classification}, make sure the upstream agent outputs a structured
result with a classification field — don’t rely on parsing free text.
6.6 — State Type Safety
When your shared state has typed fields, you get type-level safety. Common type mismatches to watch for:
| Mistake | Symptom | Fix |
|---|---|---|
Writing "5" to an int field |
TypeError on arithmetic downstream |
Ensure tools return correct types |
Writing a single message to a List field |
TypeError: 'AIMessage' is not iterable |
Wrap in a list: [message] |
| Forgetting to initialize a field | KeyError when reading |
Use .get(field, default) |
When using a typed state schema, the framework can catch type mismatches at compile time. As long as your tools return the right types, the system stays consistent.
6.7 — Defense in Depth Checklist
Before deploying a workflow, verify these safeguards are in place:
| Layer | Safeguard | How to Configure |
|---|---|---|
| Tool loops | max_tool_iterations |
Set per-agent in the workflow editor (default: 30) |
| Tool loops | Iteration warning message | Custom message at ~80% of limit |
| Tool failures | Try/except in tool implementations | Return {"success": false, "error": "..."} |
| Routing | Router fallback validation | Auto-generated by the framework |
| Output format | Structured output (Pydantic) | Enable in node config + define schema |
| State types | TypedDict + reducers | Define schema carefully in editor |
| Graph cycles | Termination conditions | Router + max iteration counter |
6.8 — Compile-Time Topology Validation
Before a workflow runs, the system should validate the graph structure to catch problems that would manifest as runtime errors (infinite loops, unreachable nodes, etc.) Good workflow frameworks run these checks automatically at compile time or export time.
| Check | What It Catches | Severity |
|---|---|---|
| Entry node has outgoing edges | Empty graph (entry goes nowhere) | Error |
| Terminal LLM node exists | No LLM node without outgoing edges — graph can never reach END | Error |
| No orphan LLM/Router nodes | Nodes with no incoming edge are unreachable | Error |
| Workers connected to LLM nodes | Disconnected workers will never be called | Error |
| Router fanout ≥ 2 | Single-output routers are pointless | Error |
Loop target has loop_mode |
LLM nodes with multiple incoming edges and no loop_mode configured | Error |
| Cycle with no Router | Guaranteed infinite loop (no decision point to exit) | Error |
| Cycle with Router but no exit | Router in cycle has all edges within the cycle | Error |
| Cycle with Router + exit path | Valid loop, but worth flagging for awareness | Warning |
| Continue mode with no feedback key | Loop re-entry with accumulated history but no new context | Warning |
Errors should block compilation. Warnings allow compilation but indicate potential issues worth investigating.
Chapter Summary
- Agentic workflows have many failure modes: runaway loops, tool errors, bad routing, type mismatches, context overflow.
max_tool_iterationsis the most important safeguard. Set it for every node with tools. Start at 15.- Use soft warnings at ~80% of the limit to nudge the LLM toward a final answer.
- Tools should return structured error objects (
{"success": false, "error": "..."}), not throw exceptions. - Router validation falls back to the first target if the LLM hallucinates a non-existent node.
- Structured output (Pydantic schemas) provides type-level validation for LLM outputs.
- Apply defense in depth: multiple layers of safeguards catch different failure modes.