Chapter 6 Validation & Error Handling

Making workflows robust against failure

LLMs are probabilistic. Tools can fail. APIs time out. Networks drop. An agentic workflow that works on the “happy path” but falls over at the first unexpected event is not production-ready.

This chapter covers the defensive design patterns that make workflows reliable: input validation, output validation, iteration limits, error handling in tools, and graceful degradation.

6.1 — What Can Go Wrong

Before discussing solutions, let’s catalog the failure modes:

Failure Mode	Example	Impact
Runaway tool loops	Agent keeps calling tools without converging	Infinite execution, unbounded cost
Wrong tool selection	Agent calls `delete_file` instead of `read_file`	Incorrect or destructive actions
Tool execution failure	API returns 500, file not found, timeout	Agent gets error instead of data
Invalid routing	Router hallucinate a target node that doesn’t exist	Execution goes to wrong place or crashes
State corruption	Node writes wrong type to a state field	Downstream nodes crash on type mismatch
Context overflow	Too many messages fill the context window	LLM loses early context, makes poor decisions
LLM refusal or hallucination	Model refuses to use a tool, invents fake output	Workflow proceeds with incorrect data

6.2 — Iteration Limits: The Circuit Breaker

The iteration limit is the single most important safeguard in any agentic workflow. Every LLM Node with tools has a max_tool_iterations setting that caps how many times the tool loop can execute.

How It Works

# In the agent configuration:
max_tool_iterations: 15
iteration_warning_message: "You are running low on tool calls.
                            Wrap up your analysis and return a final answer."

# In the generated Tool Node code:
def tool_2_node(global_state):
    current_iteration = global_state.get("node_2_tool_iteration_count", 0)

    # ── Hard limit: force stop ──
    if current_iteration >= 15:
        return Command(
            update={
                "messages": [HumanMessage(
                    content="You are out of tool calls. Now, based on everything
                             you have analyzed, return the final output."
                )]
            },
            goto="llm_2_node"   # Force LLM to give final answer
        )

    # ── Soft warning: nudge toward completion ──
    if current_iteration >= 12:   # ~80% of limit
        warning = HumanMessage(
            content="You are running low on tool calls.
                     Wrap up your analysis and return a final answer."
        )
        # Warning is appended to messages alongside tool results

    # ── Normal: execute tool, increment counter ──
    result = execute_tool(tool_call)
    return Command(
        update={
            "node_2_tool_iteration_count": current_iteration + 1,
            "messages": [ToolMessage(content=result)]
        },
        goto="llm_2_node"
    )

Choosing Iteration Limits

Agent Type	Suggested Limit	Rationale
Classifier (no tools)	0	No tools, no loop
Simple lookup agent	5–10	Query a database, format result
Investigation agent	15–30	Multiple data sources, iterative analysis
CTF exploit agent	30–50	Trial-and-error, multiple attempts
Aggressively autonomous agent	50–100	Long-running tasks, but monitor closely

Rule of thumb: Start with 15. Watch actual behavior. If the agent consistently hits the limit before finishing, increase it. If the agent finishes in 3 calls, lower it to avoid wasting tokens on a runaway case.

6.3 — Tool Error Handling

When a tool fails, the LLM should know what failed and why, so it can decide whether to retry, try a different approach, or give up gracefully. Never let exceptions bubble up silently.

Pattern: Structured Error Returns

@tool
def check_ip_reputation(ip_address: str, state: dict = None) -> dict:
    """Check an IP address against threat intelligence databases."""
    try:
        result = threat_intel_api.lookup(ip_address)
        return {
            "success": True,
            "ip": ip_address,
            "reputation": result.score,
            "tags": result.tags,
            "error": None
        }
    except ConnectionError:
        return {
            "success": False,
            "ip": ip_address,
            "reputation": None,
            "tags": [],
            "error": "Could not reach threat intel API. Service may be down."
        }
    except ValueError as e:
        return {
            "success": False,
            "ip": ip_address,
            "reputation": None,
            "tags": [],
            "error": f"Invalid IP address format: {str(e)}"
        }

When the LLM sees "success": false and a clear error message, it can reason about what to do next: retry, skip, or try an alternative tool.

Pattern: Tool Not Found

If the LLM requests a tool that doesn’t exist in the node’s tool set, the Tool Node returns an error message rather than crashing:

# Generated Tool Node handles unknown tools:
def tool_2_node(global_state):
    tool_name = tool_call["name"]
    tool_func = tools_by_name_for_node_2.get(tool_name)

    if tool_func is None:
        result = f"Error: Tool '{tool_name}' not found. "
                 f"Available tools: {list(tools_by_name_for_node_2.keys())}"
    else:
        result = tool_func.invoke(tool_call["args"])

6.4 — Router Validation

What if the Router LLM returns a next_node value that doesn’t match any of the connected targets?

# Router validation in generated code:
def router_3_node(global_state):
    decision = model_with_schema.invoke(messages)

    # Validate the decision
    available_targets = ["llm_4_node", "llm_5_node", "llm_6_node"]

    if decision.next_node in available_targets:
        # Valid choice
        return Command(
            update={"routing_reason": decision.reason},
            goto=decision.next_node
        )
    else:
        # Invalid choice — fallback to first target + warning
        print(f"⚠️ Router chose invalid target '{decision.next_node}'. "
              f"Falling back to '{available_targets[0]}'")
        return Command(
            update={"routing_reason": f"FALLBACK: {decision.reason}"},
            goto=available_targets[0]
        )

This ensures the workflow never crashes due to a routing hallucination. The fallback is safe (always a valid node), and the warning is logged so you can improve the router’s prompt.

6.5 — Structured Output Validation

When you configure an LLM Node to use structured output (a Pydantic schema), you’re asking the LLM to return JSON matching that schema. This provides type-level validation automatically:

# Define expected output structure:
class AlertClassification(BaseModel):
    category: str        # "malware", "intrusion", "misconfig"
    severity: str        # "critical", "high", "medium", "low"
    confidence: float    # 0.0 to 1.0
    reasoning: str       # Why this classification

# LangChain ensures the LLM's output matches the schema.
# If the LLM returns {"severity": 42} instead of a string,
# Pydantic validation catches it.

model_with_output = model.with_structured_output(AlertClassification)
result = model_with_output.invoke(messages)
# result.category → str ✓
# result.severity → str ✓
# result.confidence → float ✓

When to use structured output: Use it whenever downstream logic depends on specific fields. If another agent’s prompt uses {classification}, make sure the upstream agent outputs a structured result with a classification field — don’t rely on parsing free text.

6.6 — State Type Safety

When your shared state has typed fields, you get type-level safety. Common type mismatches to watch for:

Mistake	Symptom	Fix
Writing `"5"` to an `int` field	`TypeError` on arithmetic downstream	Ensure tools return correct types
Writing a single message to a `List` field	`TypeError: 'AIMessage' is not iterable`	Wrap in a list: `[message]`
Forgetting to initialize a field	`KeyError` when reading	Use `.get(field, default)`

When using a typed state schema, the framework can catch type mismatches at compile time. As long as your tools return the right types, the system stays consistent.

6.7 — Defense in Depth Checklist

Before deploying a workflow, verify these safeguards are in place:

Layer	Safeguard	How to Configure
Tool loops	`max_tool_iterations`	Set per-agent in the workflow editor (default: 30)
Tool loops	Iteration warning message	Custom message at ~80% of limit
Tool failures	Try/except in tool implementations	Return `{"success": false, "error": "..."}`
Routing	Router fallback validation	Auto-generated by the framework
Output format	Structured output (Pydantic)	Enable in node config + define schema
State types	TypedDict + reducers	Define schema carefully in editor
Graph cycles	Termination conditions	Router + max iteration counter

6.8 — Compile-Time Topology Validation

Before a workflow runs, the system should validate the graph structure to catch problems that would manifest as runtime errors (infinite loops, unreachable nodes, etc.) Good workflow frameworks run these checks automatically at compile time or export time.

Check	What It Catches	Severity
Entry node has outgoing edges	Empty graph (entry goes nowhere)	Error
Terminal LLM node exists	No LLM node without outgoing edges — graph can never reach END	Error
No orphan LLM/Router nodes	Nodes with no incoming edge are unreachable	Error
Workers connected to LLM nodes	Disconnected workers will never be called	Error
Router fanout ≥ 2	Single-output routers are pointless	Error
Loop target has `loop_mode`	LLM nodes with multiple incoming edges and no loop_mode configured	Error
Cycle with no Router	Guaranteed infinite loop (no decision point to exit)	Error
Cycle with Router but no exit	Router in cycle has all edges within the cycle	Error
Cycle with Router + exit path	Valid loop, but worth flagging for awareness	Warning
Continue mode with no feedback key	Loop re-entry with accumulated history but no new context	Warning

Errors should block compilation. Warnings allow compilation but indicate potential issues worth investigating.

Chapter Summary

Key Takeaways:

Agentic workflows have many failure modes: runaway loops, tool errors, bad routing, type mismatches, context overflow.
max_tool_iterations is the most important safeguard. Set it for every node with tools. Start at 15.
Use soft warnings at ~80% of the limit to nudge the LLM toward a final answer.
Tools should return structured error objects ({"success": false, "error": "..."}), not throw exceptions.
Router validation falls back to the first target if the LLM hallucinates a non-existent node.
Structured output (Pydantic schemas) provides type-level validation for LLM outputs.
Apply defense in depth: multiple layers of safeguards catch different failure modes.

← Chapter 5: Information Flow Chapter 7: Putting It All Together →