A multi-step agent picks a tool, calls it, reads the result, picks the next. Recovery is what happens when a tool fails: the agent picks a different tool and continues.
def agent_with_recovery(goal, tools):
used = []
last_error = None
for tool in tools:
try:
result = tool(goal)
used.append(tool.__name__)
return {"used": used, "result": result}
except Exception as e:
used.append(f"{tool.__name__}-FAILED")
last_error = e
continue
raise RuntimeError(f"all tools failed; last error: {last_error}")This is just fallback chains again, isn't it?
The structure is similar; the responsibility is different. Fallback chain → "answer a query". Multi-step → "complete a goal that may need multiple successful tool calls in sequence". When a step in the sequence fails, the agent doesn't restart — it picks a different tool for that step and continues.
And how does the agent know which tool to pick next?
Two strategies. Static — a hardcoded ordered list of fallback tools (today's lesson). Dynamic — the LLM picks a tool based on the failure (production agent loops). For this curriculum we stay with static; agent loops are a separate track.
class StepResult:
def __init__(self, status, value=None, error=None, used_tool=None):
self.status = status # 'ok' | 'failed'
self.value = value
self.error = error
self.used_tool = used_tool
def try_step(goal, tool_options):
"""Try each tool until one succeeds. Return StepResult with the tool that worked."""
for tool in tool_options:
try:
value = tool(goal)
return StepResult("ok", value, used_tool=tool.__name__)
except Exception as e:
continue
return StepResult("failed", error="all tools exhausted")| Style | How tool is picked | Fits |
|---|---|---|
| Static | Hardcoded ordered list — try A, then B | Known failure modes; deterministic |
| Dynamic | LLM sees failure, picks next tool | Unknown failure space; production agents |
Dynamic is more powerful but adds an LLM call per recovery — costlier, harder to test. Static is simpler and covers most production cases. Today's lesson is static.
The distinction blurs at the edges. The pattern is the same: when the primary path fails, route around the failure.
used = ["tool_A-FAILED", "tool_B"]Keep the trail of attempts. When the agent succeeds, you know how it got there. When it fails, you know what was tried. Production observability builds on this.
A multi-step agent picks a tool, calls it, reads the result, picks the next. Recovery is what happens when a tool fails: the agent picks a different tool and continues.
def agent_with_recovery(goal, tools):
used = []
last_error = None
for tool in tools:
try:
result = tool(goal)
used.append(tool.__name__)
return {"used": used, "result": result}
except Exception as e:
used.append(f"{tool.__name__}-FAILED")
last_error = e
continue
raise RuntimeError(f"all tools failed; last error: {last_error}")This is just fallback chains again, isn't it?
The structure is similar; the responsibility is different. Fallback chain → "answer a query". Multi-step → "complete a goal that may need multiple successful tool calls in sequence". When a step in the sequence fails, the agent doesn't restart — it picks a different tool for that step and continues.
And how does the agent know which tool to pick next?
Two strategies. Static — a hardcoded ordered list of fallback tools (today's lesson). Dynamic — the LLM picks a tool based on the failure (production agent loops). For this curriculum we stay with static; agent loops are a separate track.
class StepResult:
def __init__(self, status, value=None, error=None, used_tool=None):
self.status = status # 'ok' | 'failed'
self.value = value
self.error = error
self.used_tool = used_tool
def try_step(goal, tool_options):
"""Try each tool until one succeeds. Return StepResult with the tool that worked."""
for tool in tool_options:
try:
value = tool(goal)
return StepResult("ok", value, used_tool=tool.__name__)
except Exception as e:
continue
return StepResult("failed", error="all tools exhausted")| Style | How tool is picked | Fits |
|---|---|---|
| Static | Hardcoded ordered list — try A, then B | Known failure modes; deterministic |
| Dynamic | LLM sees failure, picks next tool | Unknown failure space; production agents |
Dynamic is more powerful but adds an LLM call per recovery — costlier, harder to test. Static is simpler and covers most production cases. Today's lesson is static.
The distinction blurs at the edges. The pattern is the same: when the primary path fails, route around the failure.
used = ["tool_A-FAILED", "tool_B"]Keep the trail of attempts. When the agent succeeds, you know how it got there. When it fails, you know what was tried. Production observability builds on this.
Create a free account to get started. Paid plans unlock all tracks.