WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[ArXiv 2026] Trace2Skill: Distilling Trajectory-Local Lessons into Transferable Agent Skills
总结
问题
方法
结果
要点
摘要

Trace2Skill is a novel framework for the automated creation and optimization of Large Language Model (LLM) agent skills by distilling trajectory-local lessons into transferable, declarative artifacts. It utilizes a parallel multi-agent architecture to analyze execution experiences and hierarchically consolidates them into a single, conflict-free domain directory, achieving SOTA performance in spreadsheet manipulation and reasoning tasks.

TL;DR

Trace2Skill is a breakthrough framework that automates the "Human Expert" approach to skill authoring for LLM agents. By analyzing hundreds of execution trajectories in parallel and using hierarchical inductive reasoning to merge "patches," it creates highly transferable, declarative skills. It allows a 35B model to generate skills that boost a 122B model's performance by over 50%, requiring no parameter updates or external retrieval modules.

Why Current Agents "Forget" or "Overfit"

The industry currently faces a "Skill Gap." Human-written skills are high quality but don't scale. Conversely, automated "Online Learning" (updating skills one trajectory at a time) often leads to:

  1. Fragmentation: Creating thousands of tiny, specific rules that break retrieval systems.
  2. Sequential Bias: The model learns from Task A, changes its skill, and then misinterprets Task B because the underlying "textbook" changed mid-stream.
  3. Lack of Intuition: Most systems treat every error as a unique event rather than looking for the systemic reason why a specific tool (like pandas) consistently fails in a specific environment.

Methodology: The "Many-to-One" Architecture

Trace2Skill replaces the reactive, sequential update loop with a Holistic Consolidation Pipeline.

1. Parallel Fleet of Analysts

Instead of one model thinking about one error, Trace2Skill dispatches a fleet of sub-agents.

  • Success Analysts: Identify "Golden Paths."
  • Error Analysts: Use an Agentic Loop (looking at files, testing fixes) to find the true root cause, preventing the hallucinated diagnoses common in single-pass LLM prompts.

2. Hierarchical Consolidation (Inductive Reasoning)

The core innovation is how these "patches" are merged. By merging patches in a tree-like hierarchy, the system performs Prevalence-Weighted Induction. If 50 independent agents all suggest a "Checklist for Formula Recalculation," the system promotes this to a "Global Standard Operating Procedure (SOP)."

Trace2Skill Pipeline Figure 1: The Three-Stage Pipeline: Generation, Parallel Patching, and Hierarchical Consolidation.

Experimental Evidence: Success Across Scales

The most striking result is Cross-Model Transferability. Traditionally, we assume a small model (35B) can't teach a large model (122B). Trace2Skill proves otherwise.

| Metric | 122B (No Skill) | 122B (w/ 35B-Authored Skill) | Delta | | :--- | :---: | :---: | :---: | | WikiTableQuestions (OOD) | 21.50% | 81.38% | +59.88% | | SpreadsheetBench (Vrf) | 27.67% | 65.00% | +37.33% |

Performance Comparison Table 1: Evolution results showing massive gains in OOD and Cross-Model scenarios.

Key Insights from the Data:

  • Agentic > Single Call: Using a loop for error analysis (allowing the analyst to "double-check" their fix) provided a massive +13.3% boost in helpulness.
  • Parallel > Sequential: Parallel merging is not only 20x faster but also prevents "Parameter Drift," resulting in more stable and generalizable skills.

Real-World SOPs Discovered

The system didn't just find "tricks"; it distilled engineering best practices that even human experts sometimes overlook:

  1. Mandatory Recalculation: Always run recalc.py after writing Excel formulas (prevents stale cells).
  2. Tool Hierarchy: Prefer openpyxl over pandas for structural edits to avoid destroying cell formatting.
  3. Bottom-Up Deletion: Delete rows in descending order to avoid index-shift corruption.

Conclusion: A New Paradigm for Agentic Memory

Trace2Skill shifts the focus from Retrieved Episodic Memory (storing thousands of old logs) to Distilled Procedural Knowledge (one clean, updated manual). This study proves that LLM experience can be "compressed" into declarative markdown that is architecture-agnostic, portable, and remarkably robust.

Future Directions: The authors suggest moving toward "Causal Attribution," where we can quantify exactly which trajectory-lesson led to the a 5% bump in accuracy.

发现相似论文

试试这些示例

  • Search for recent papers published after 2024 that utilize parallel multi-agent swarms for automated prompt engineering or agentic workflow optimization.
  • Which study first introduced the concept of "Agent Skills" as structured markdown documents, and how does Trace2Skill's inductive consolidation differ from the original framework's update mechanism?
  • Explore research that applies declarative SOP distillation (similar to Trace2Skill) to non-textual agent domains such as Robotics or Reinforcement Learning in Minecraft.
目录
[ArXiv 2026] Trace2Skill: Distilling Trajectory-Local Lessons into Transferable Agent Skills
1. TL;DR
2. Why Current Agents "Forget" or "Overfit"
3. Methodology: The "Many-to-One" Architecture
3.1. 1. Parallel Fleet of Analysts
3.2. 2. Hierarchical Consolidation (Inductive Reasoning)
4. Experimental Evidence: Success Across Scales
4.1. Key Insights from the Data:
5. Real-World SOPs Discovered
6. Conclusion: A New Paradigm for Agentic Memory