Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

[CVPR 2024] Trace-Free+: Solving the "Cold-Start" Bottleneck in LLM-Agent Tool Use

总结

问题

方法

结果

要点

摘要

Trace-Free+ is a curriculum learning framework designed to improve LLM-agent tool use by automatically rewriting tool descriptions and parameter schemas. It utilizes a novel training approach that transfers knowledge from trace-rich environments to trace-free deployment, achieving SFT-level performance on unseen tools in Benchmarks like StableToolBench and RestBench.

TL;DR

While the industry has obsessed over making agents "smarter" through fine-tuning, Trace-Free+ shifts the focus to the Tool Interface. By training a model to rewrite cryptic, human-oriented API descriptions into agent-optimized instructions via curriculum learning, this framework enables reliable tool use even when no execution traces are available. It effectively bridges the gap between trace-heavy optimization and real-world "cold-start" deployments.

The Motivation: The "Useless Manual" Problem

Imagine giving an expert craftsman a tool manual written in a language they barely understand. No matter how skilled the craftsman (the LLM Agent), the job will fail. Currently, API descriptions in datasets like ToolBench are often:

Inconsistent: Different terminologies for the same concepts.
Incomplete: Missing critical constraints (e.g., "only accepts IPv6").
Human-Centric: Fluff text that confuses an LLM's attention mechanism.

Previous SOTA methods (like DRAFT or Play2Prompt) tried to fix this by "playing" with the tools first, seeing what breaks, and then fixing the description. But what if you can't run the tool? In privacy-restricted or new API environments, you don't have the luxury of "failure traces."

Methodology: Teaching the Model to "Anticipate" Failure

The authors introduce Trace-Free+, a learning-based approach that doesn't just fix one tool, but learns how tools generally break.

1. The Data Synthesis Pipeline

To train a generalizable rewriter, they built a massive dataset:

Agentic Annotation: Used Smolagents to probe 9,640 API providers to find "healthy" tools and collect real response examples.
Dependency-Aware Synthesis: Instead of simple queries, they forced the synthesis of multi-hop tasks where API-B must follow API-A based on real calling patterns.

2. Curriculum Learning

This is the "secret sauce." The model is trained in stages:

Stage 1 (Trace-Rich): The model sees the original description + execution traces (failure logs) and learns to generate a "perfect" description (D2).
Stage 2 (Trace-Free Transition): The traces are gradually removed. Because the model was "primed" with traces, it learns to infer potential failure points just by looking at the raw schema.

The SFT Data Synthesis Pipeline

Experiments: Does it Scale?

The researchers tested the model on StableToolBench and RestBench. The results were telling:

Superior Generalization: On unseen tools, Trace-Free+ improved subtask-level success significantly. For the most complex multi-hop queries (G3), it nearly doubled the success rate of the original descriptions.
The Scaling Challenge: In real-world apps, an agent might see 100+ tools. Most agents "hallucinate" or lose track as the tool list grows. Trace-Free+ descriptions proved much more robust, maintaining a higher Query-Level (QL) success rate as the "distractor" tools increased.

Scaling Experiment Results

Critical Insights & Takeaways

Interface > Reasoning: Sometimes the "reasoning" failure of an agent is actually a "documentation" failure.
The Power of SFT: Unlike prompt-based optimizers (EasyTool) which can be brittle across different LLM versions, a fine-tuned description generator (like the Qwen3-4B used here) internalizes the "physics" of API interaction.
Limitation: The model still relies on the quality of the base schema. If the JSON schema itself is fundamentally lying about the API's existence, even the best rewrite can only do so much.

Conclusion

Trace-Free+ proves that we can "learn to describe." By moving away from iterative trial-and-error at inference time and toward a robust, pre-trained interface optimizer, we make LLM agents truly deployable in complex, enterprise-grade environments.

Technical Review by Senior Academic Tech Editor

发现相似论文

试试这些示例

Find recent papers from 2024-2025 that focus specifically on optimizing API documentation or JSON schemas for Large Language Model consumption.
Which paper first formally defined the "Tool-using Agent" paradigm, and how has the focus shifted from agent reasoning to tool interface quality over time?
Are there any studies exploring the application of Trace-Free curriculum learning to robotic process automation (RPA) or multi-modal agent environments?

[CVPR 2024] Trace-Free+: Solving the "Cold-Start" Bottleneck in LLM-Agent Tool Use

1. TL;DR

2. The Motivation: The "Useless Manual" Problem

3. Methodology: Teaching the Model to "Anticipate" Failure

3.1. 1. The Data Synthesis Pipeline

3.2. 2. Curriculum Learning

4. Experiments: Does it Scale?

5. Critical Insights & Takeaways

5.1. Conclusion