When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Abstract
ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
Community
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents (2026)
- ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox (2026)
- Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents (2026)
- The Amazing Agent Race: Strong Tool Users, Weak Navigators (2026)
- DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints (2026)
- Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling (2026)
- CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.05806 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper