AI agent error handling patterns
A practical guide to error handling for AI agents — retry strategies, fallback behaviors, cost spike prevention, and graceful degradation patterns.
OpenAI’s best practices for GPT applications cover prompt engineering patterns that reduce errors — including the structured output and retry strategies recommended in this guide.
TL;DR: Your agent will fail — the difference between demo and production is how you handle failures. This post covers 6 battle-tested patterns: retry with exponential backoff, model fallback, circuit breakers, cost caps, graceful degradation, and structured logging. Implement these to cut your incident rate from weekly to monthly.
Your agent will fail. Not sometimes. Regularly.
I learned this the hard way. My first production agent ran beautifully in testing. In production, it failed within an hour — infinite loop on a tool call, burned through ₹800 in API costs before I noticed.
The difference between a demo agent and a production agent isn’t the model or the prompt. It’s the error handling. A demo agent works when everything goes right. A production agent works because it handles everything going wrong.
Here are the error handling patterns I’ve battle-tested across shipping agents for clients. These patterns cut my incident rate from weekly to monthly.
Key takeaways:
- Three categories of agent failures: LLM API errors, tool execution failures, and logic/reasoning errors
- Retry with exponential backoff and model fallback for API errors
- Circuit breaker pattern prevents cascading failures
- Structured logging with context is essential for debugging
The three types of agent failures
Every agent failure I’ve seen falls into one of three categories:
1. LLM API errors
The API is down. You hit a rate limit. The model is overloaded. Your request timed out.
These are the easiest to handle because they’re predictable. Every LLM provider has documented error codes and rate limits.
2. Tool execution failures
Your agent tries to read a file that doesn’t exist. The API it calls returns a 500. The database query times out. The shell command fails.
These are harder because the agent has to interpret the error and decide what to do next.
3. Logic/reasoning errors
The agent loops on the same tool call. It misreads the tool output and picks the wrong branch. It hallucinates a tool that doesn’t exist. It goes off on a tangent and never comes back to the task.
These are the hardest to catch because nothing technically fails — the agent just produces wrong or useless output.
Pattern 1: Retry with exponential backoff
For LLM API errors, the simplest and most effective pattern is exponential backoff with jitter:
import time
import random
from anthropic import Anthropic
def call_llm_with_retry(client, messages, max_retries=3, base_delay=1.0):
last_error = None
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=messages,
)
return response
except Exception as e:
last_error = e
error_code = getattr(e, 'status_code', 0)
# Don't retry client errors
if error_code in (400, 401, 403, 404):
raise
# Retry on rate limits and server errors
if error_code in (429, 500, 502, 503, 529):
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
print(f"API error {error_code}, retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
continue
# Timeout errors — retry
if isinstance(e, TimeoutError):
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(delay)
continue
raise
raise last_error
Expected savings: This pattern alone catches about 90% of transient API errors. Rate limits are almost always resolved within 3 retries.
Pattern 2: Model fallback
Some errors are model-specific. The model might be overloaded, or a specific model might consistently fail on a particular task. Fall back to a different model:
MODEL_PRIORITY = [
"claude-sonnet-4-20250514",
"claude-haiku-3-20240307",
"gpt-4o",
"gpt-4o-mini",
]
def call_with_fallback(client, messages, tools=None):
errors = []
for model in MODEL_PRIORITY:
try:
return client.messages.create(
model=model,
max_tokens=4096,
messages=messages,
tools=tools,
)
except Exception as e:
errors.append(f"{model}: {str(e)[:100]}")
print(f"Model {model} failed, falling back...")
continue
raise Exception(f"All models failed:\n" + "\n".join(errors))
When to use: This is critical for high-availability agents. If you’re running a customer-facing agent, you can’t have it go down because Claude is having an outage. Fall back to GPT-4o and keep running.
Cost implication: Your fallback model might be more expensive or less capable. Track which model was used and log any fallback events for later review.
Pattern 3: Circuit breaker for agents
The circuit breaker pattern prevents a failing agent from repeatedly hitting the same error. After N consecutive failures, the circuit “opens” and subsequent calls fail fast without hitting the LLM:
import time
from datetime import datetime, timedelta
class AgentCircuitBreaker:
def __init__(self, failure_threshold=3, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
def call(self, agent_fn, *args, **kwargs):
if self.state == "open":
if datetime.now() - self.last_failure_time > timedelta(seconds=self.recovery_timeout):
self.state = "half-open"
print(f"Circuit breaker: half-open, trying one request")
else:
raise Exception("Circuit breaker is open. Agent is unavailable.")
try:
result = agent_fn(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
print(f"Circuit breaker: closed again")
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = "open"
print(f"Circuit breaker: OPEN after {self.failure_count} failures")
raise e
Expected savings: A circuit breaker caught a bug in one of my agents where a malformed tool response was causing the agent to retry every 30 seconds. Without the breaker, that agent would have run ₹200/day in failed attempts. With it, it failed fast and I got alerted within minutes.
Pattern 4: Cost caps and alerts
Cost spikes are the silent killer of agent deployments. An agent that loops without producing useful output can burn through your API budget before you notice:
import os
class CostTracker:
def __init__(self, max_cost_per_run=100): # ₹100 max per run
self.total_cost = 0.0
self.max_cost_per_run = max_cost_per_run
self.per_run_limits = {}
def track(self, run_id, input_tokens, output_tokens, model="claude-sonnet"):
# Pricing in INR per token (approximate)
rates = {
"claude-sonnet": {"input": 0.25, "output": 1.25}, # per 1K tokens
"claude-haiku": {"input": 0.03, "output": 0.15},
"gpt-4o": {"input": 0.20, "output": 0.80},
"gpt-4o-mini": {"input": 0.01, "output": 0.04},
}
rate = rates.get(model, rates["claude-sonnet"])
cost = (input_tokens / 1000 * rate["input"]) + (output_tokens / 1000 * rate["output"])
self.total_cost += cost
# Track per-run cost
if run_id not in self.per_run_limits:
self.per_run_limits[run_id] = 0
self.per_run_limits[run_id] += cost
# Alert if per-run exceeds limit
if self.per_run_limits[run_id] > self.max_cost_per_run:
raise Exception(
f"Run {run_id} exceeded cost limit of ₹{self.max_cost_per_run}. "
f"Current cost: ₹{self.per_run_limits[run_id]:.2f}"
)
return cost
Four cost guards I use on every production agent:
- Per-run token budget — max 50,000 tokens per agent run
- Circuit breaker — stops after 3 consecutive failures
- Concurrent run cap — max 5 concurrent agent executions
- Daily cost alert — email/Slack if daily cost exceeds ₹500
These four guards have stopped every cost spike I’ve encountered in the last 6 months.
Pattern 5: Graceful degradation
When a tool fails, the agent needs to decide what to do next. Should it retry? Try a different tool? Report the failure to the user? The answer depends on the tool and the context:
def safe_tool_call(tool_fn, *args, context=None, **kwargs):
"""Execute a tool call with graceful degradation."""
try:
result = tool_fn(*args, **kwargs)
return {
"success": True,
"result": result,
"error": None
}
except FileNotFoundError:
return {
"success": False,
"result": None,
"error": {
"type": "file_not_found",
"message": f"File not found: {args}",
"suggested_fix": "Check if the path is correct and the file exists",
"degradation": "skip" # Skip this tool and continue
}
}
except PermissionError:
return {
"success": False,
"result": None,
"error": {
"type": "permission_denied",
"message": f"No permission to access: {args}",
"suggested_fix": "Check file permissions or use a different path",
"degradation": "fallback" # Try alternative approach
}
}
except ConnectionError:
return {
"success": False,
"result": None,
"error": {
"type": "network_error",
"message": f"Could not connect to service",
"suggested_fix": "Check network connectivity or retry later",
"degradation": "retry_later" # Can't proceed without this tool
}
}
except Exception as e:
return {
"success": False,
"result": None,
"error": {
"type": "unexpected_error",
"message": str(e)[:200],
"suggested_fix": "Check logs for details",
"degradation": "report" # Report to user
}
}
The key insight: return a structured error object that the LLM can understand and act on. The degradation field tells the LLM how to proceed:
- skip — this tool failed, but the agent can continue without it
- fallback — try a different approach or tool
- retry_later — this step is essential but can be retried
- report — critical failure, inform the user
Pattern 6: Log everything with context
You can’t debug agent failures without logs. But agent logs are different from regular application logs — they need to capture the decision-making process:
import json
from datetime import datetime
class AgentLogger:
def __init__(self, agent_name, run_id):
self.agent_name = agent_name
self.run_id = run_id
self.events = []
def log_llm_call(self, model, messages, response, latency_ms):
self.events.append({
"type": "llm_call",
"timestamp": datetime.now().isoformat(),
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": latency_ms,
"has_tool_calls": any(b.type == "tool_use" for b in response.content),
"run_id": self.run_id,
})
def log_tool_call(self, tool_name, args, result, latency_ms, success):
self.events.append({
"type": "tool_call",
"timestamp": datetime.now().isoformat(),
"tool": tool_name,
"args": args,
"result_truncated": str(result)[:500],
"latency_ms": latency_ms,
"success": success,
"run_id": self.run_id,
})
def log_routing_decision(self, from_node, to_node, reason):
self.events.append({
"type": "routing",
"timestamp": datetime.now().isoformat(),
"from": from_node,
"to": to_node,
"reason": reason,
"run_id": self.run_id,
})
def flush(self):
# Write to file or send to logging service
with open(f"logs/agent_{self.agent_name}_{self.run_id}.jsonl", "a") as f:
for event in self.events:
f.write(json.dumps(event) + "\n")
What to log in every agent run:
- Every LLM API call — model, tokens, latency, whether tools were called
- Every tool call — tool name, arguments, result (truncated), success/failure
- Every routing decision — which node is next and why
- Every retry or fallback — what failed and what the recovery action was
- Start and end timestamps — to calculate total cost and duration
With these logs, you can replay any agent run and understand exactly what happened. Without them, debugging is guessing.
Putting it all together
Here’s the skeleton of a production agent with all error handling patterns applied:
class ProductionAgent:
def __init__(self, agent_name):
self.agent_name = agent_name
self.llm_client = Anthropic()
self.circuit_breaker = AgentCircuitBreaker()
self.cost_tracker = CostTracker()
def run(self, task, run_id):
logger = AgentLogger(self.agent_name, run_id)
try:
result = self.circuit_breaker.call(
self._execute_agent, task, logger
)
logger.flush()
return result
except Exception as e:
logger.log_routing_decision("agent", "failed", str(e)[:200])
logger.flush()
return {
"status": "failed",
"error": str(e)[:500],
"run_id": run_id,
"logs": f"logs/agent_{self.agent_name}_{run_id}.jsonl"
}
def _execute_agent(self, task, logger):
# The core agent loop with error handling
messages = [{"role": "user", "content": task}]
for turn in range(10):
start = time.time()
response = call_llm_with_retry(self.llm_client, messages)
latency = (time.time() - start) * 1000
logger.log_llm_call("claude-sonnet", messages, response, latency)
self.cost_tracker.track(run_id, response.usage.input_tokens, response.usage.output_tokens)
messages.append({"role": "assistant", "content": response.content})
tool_uses = [b for b in response.content if b.type == "tool_use"]
if not tool_uses:
return response.content[0].text
for tool_use in tool_uses:
start = time.time()
result = safe_tool_call(execute_tool, tool_use.name, tool_use.input)
latency = (time.time() - start) * 1000
logger.log_tool_call(tool_use.name, tool_use.input, result, latency, result["success"])
if not result["success"]:
if result["error"]["degradation"] == "skip":
continue
elif result["error"]["degradation"] == "report":
return {"status": "partial", "error": result["error"]["message"]}
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": tool_use.id, "content": str(result)}]
})
return {"status": "max_turns_reached"}
The debugging workflow
When an agent fails in production, here’s my process:
-
Check the logs first. Find the run_id, read the JSONL file, and trace the execution path. What did the LLM decide? Which tool was called? What did it return?
-
Reproduce the failure. Run the same input against your agent in development. Is it deterministic or does the LLM respond differently each time?
-
Add guardrails. Based on what went wrong, add one of the patterns above — a retry, a cost cap, a structured error handler.
-
Monitor the fix. Watch the next 100 runs to confirm the pattern works. If the failure doesn’t reoccur within 100 runs, the fix is probably solid.
Related: AI agent cost optimization: 10 tips to reduce your LLM bill — strategies for keeping agent costs under control.
Related: How to build an AI customer support agent (that actually works) — error handling patterns for customer support agents including escalation logic and confidence thresholds.
If you only implement one pattern from this article, make it structured logging. You can't fix what you can't see. Add logging today, add the rest as you encounter each failure mode. Every production agent I've built started with logging and grew the error handling patterns organically as each failure type appeared.