EvoSkill: Automated Skill Discovery for Multi-Agent Systems

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

[Research Deep-Dive] EvoSkill: Transitioning from Prompt Engineering to Automated Skill Evolution

Summary

Problem

Method

Results

Takeaways

Abstract

EvoSkill is a self-evolving framework designed to automatically discover, refine, and materialize reusable agent skills through iterative failure analysis. By using a "Proposer-Builder" architecture, it enables coding agents like Claude Code to achieve new SOTA results on grounded reasoning (OfficeQA) and search-augmented QA (SealQA) benchmarks.

TL;DR

EvoSkill is a framework that allows AI agents to "learn from their mistakes" by automatically writing their own SOPs (Standard Operating Procedures) and helper scripts. By analyzing where it failed on a task, the system generates a structured Skill, evaluates its impact, and adds it to a permanent library. It achieved a +12.1% boost on search tasks and proved that these evolved skills can be plucked from one agent and used by another with zero-shot success.

The Problem: The "Fragility" of Prompt Engineering

In the world of AI agents, we usually try to improve performance in two ways:

Prompt Optimization: Tweaking instructions until the model works. (Fragile, model-specific).
Fine-tuning: Training the model on new data. (Expensive, black-box).

The authors of EvoSkill argue that both methods fail to provide transferable domain expertise. When a human expert tackles a complex task like analyzing U.S. Treasury data, they don't just "think harder"; they develop tools, checklists, and repeatable workflows. EvoSkill automates this "tool-building" process.

Methodology: The Proposer-Builder Loop

EvoSkill operates on a "Social Learning" architecture consisting of three specialized roles:

The Executor: Actually tries to solve the task using current skills.
The Proposer: The "Critic." It looks at the logs of failed attempts, compares them to the ground truth, and says, "We failed because we didn't verify the currency conversion; we need a skill for that."
The Skill-Builder: The "Engineer." It takes the Proposer's suggestion and actually writes the SKILL.md (instructions) and analysis.py (scripts).

The Evolutionary Pareto Frontier

Instead of just keeping the latest version, EvoSkill maintains a Pareto frontier of the best "agent programs." Every new skill is tested on a validation set; if it doesn't improve the score, it's discarded. This prevents "skill bloat" and ensures only high-quality expertise is retained.

EvoSkill Loop Architecture

Experiments: Breaking the SOTA

The researchers tested EvoSkill on two grueling benchmarks:

OfficeQA: 89,000 pages of complex financial tables.
SealQA: Open-web search where results are intentionally noisy or conflicting.

Key Breakthrough: Skill Merging

The study found that "merging" skills from different evolutionary runs yielded the best results (67.9% accuracy). This suggests that different runs discover different "failure modes," and like a team of experts, their combined knowledge is greater than the sum of its parts.

Performance Comparison on OfficeQA

Deep Insight: Why "Skills" Matter More than "Prompts"

The most exciting result isn't the accuracy boost—it's the Zero-Shot Transfer. The team took a search-persistence-protocol (a skill evolved to handle tricky web searches in SealQA) and gave it to a new agent on a different benchmark (BrowseComp).

Result: Even though the agent had never seen BrowseComp, the "skill" improved its performance by 5.3%.

This proves that procedural knowledge (e.g., "Always check three sources before answering") is a universal asset for AI, independent of the specific task or model.

Critical Analysis & Conclusion

EvoSkill represents a shift toward Meta-Cognitive Engineering. Rather than trying to bake knowledge into the model's weights, we are teaching the model how to build its own external "brain" of tools and documents.

Limitations: Currently, the "Proposer" relies on ground-truth answers to diagnose failures, which isn't always available in real-world, unlabeled scenarios. Future work will likely need to move toward "unsupervised" failure analysis using self-consistency or multi-agent debate.

Takeaway: The future of AGI might not be one giant model that knows everything, but a "frozen" core model that manages an ever-growing, self-evolved library of expert skills.

Find Similar Papers

Try Our Examples

Search for recent papers on automated skill discovery in LLM agents that utilize execution traces or environment feedback for iterative refinement.
Which study first introduced the concept of "Agent Skills" as structured filesystem directories (SKILL.md), and how does EvoSkill's evolution mechanism build upon that specification?
Investigate research exploring the zero-shot transferability of LLM-generated code tools or procedural workflows between different reasoning benchmarks.

Contents

[Research Deep-Dive] EvoSkill: Transitioning from Prompt Engineering to Automated Skill Evolution

1. TL;DR

2. The Problem: The "Fragility" of Prompt Engineering

3. Methodology: The Proposer-Builder Loop

3.1. The Evolutionary Pareto Frontier

4. Experiments: Breaking the SOTA

4.1. Key Breakthrough: Skill Merging

5. Deep Insight: Why "Skills" Matter More than "Prompts"

6. Critical Analysis & Conclusion