Trace-Free+ is a curriculum learning framework designed to improve LLM-agent tool use by automatically rewriting tool descriptions and parameter schemas. It utilizes a novel training approach that transfers knowledge from trace-rich environments to trace-free deployment, achieving SFT-level performance on unseen tools in Benchmarks like StableToolBench and RestBench.
TL;DR
While the industry has obsessed over making agents "smarter" through fine-tuning, Trace-Free+ shifts the focus to the Tool Interface. By training a model to rewrite cryptic, human-oriented API descriptions into agent-optimized instructions via curriculum learning, this framework enables reliable tool use even when no execution traces are available. It effectively bridges the gap between trace-heavy optimization and real-world "cold-start" deployments.
The Motivation: The "Useless Manual" Problem
Imagine giving an expert craftsman a tool manual written in a language they barely understand. No matter how skilled the craftsman (the LLM Agent), the job will fail. Currently, API descriptions in datasets like ToolBench are often:
- Inconsistent: Different terminologies for the same concepts.
- Incomplete: Missing critical constraints (e.g., "only accepts IPv6").
- Human-Centric: Fluff text that confuses an LLM's attention mechanism.
Previous SOTA methods (like DRAFT or Play2Prompt) tried to fix this by "playing" with the tools first, seeing what breaks, and then fixing the description. But what if you can't run the tool? In privacy-restricted or new API environments, you don't have the luxury of "failure traces."
Methodology: Teaching the Model to "Anticipate" Failure
The authors introduce Trace-Free+, a learning-based approach that doesn't just fix one tool, but learns how tools generally break.
1. The Data Synthesis Pipeline
To train a generalizable rewriter, they built a massive dataset:
- Agentic Annotation: Used
Smolagentsto probe 9,640 API providers to find "healthy" tools and collect real response examples. - Dependency-Aware Synthesis: Instead of simple queries, they forced the synthesis of multi-hop tasks where API-B must follow API-A based on real calling patterns.
2. Curriculum Learning
This is the "secret sauce." The model is trained in stages:
- Stage 1 (Trace-Rich): The model sees the original description + execution traces (failure logs) and learns to generate a "perfect" description (D2).
- Stage 2 (Trace-Free Transition): The traces are gradually removed. Because the model was "primed" with traces, it learns to infer potential failure points just by looking at the raw schema.

Experiments: Does it Scale?
The researchers tested the model on StableToolBench and RestBench. The results were telling:
- Superior Generalization: On unseen tools, Trace-Free+ improved subtask-level success significantly. For the most complex multi-hop queries (G3), it nearly doubled the success rate of the original descriptions.
- The Scaling Challenge: In real-world apps, an agent might see 100+ tools. Most agents "hallucinate" or lose track as the tool list grows. Trace-Free+ descriptions proved much more robust, maintaining a higher Query-Level (QL) success rate as the "distractor" tools increased.

Critical Insights & Takeaways
- Interface > Reasoning: Sometimes the "reasoning" failure of an agent is actually a "documentation" failure.
- The Power of SFT: Unlike prompt-based optimizers (EasyTool) which can be brittle across different LLM versions, a fine-tuned description generator (like the Qwen3-4B used here) internalizes the "physics" of API interaction.
- Limitation: The model still relies on the quality of the base schema. If the JSON schema itself is fundamentally lying about the API's existence, even the best rewrite can only do so much.
Conclusion
Trace-Free+ proves that we can "learn to describe." By moving away from iterative trial-and-error at inference time and toward a robust, pre-trained interface optimizer, we make LLM agents truly deployable in complex, enterprise-grade environments.
Technical Review by Senior Academic Tech Editor
