Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

[CVPR 2026] AgentSkillOS: Scaling the Skill Ecosystem through Hierarchical Discovery and DAG Orchestration

总结

问题

方法

结果

要点

摘要

This paper introduces AgentSkillOS, a principled framework for managing and orchestrating large-scale AI agent skill ecosystems (up to 200,000 skills). It employs a hierarchical "Capability Tree" for efficient skill discovery and a DAG-based orchestration engine to compose multiple skills for complex task execution, achieving SOTA performance in artifact-rich benchmarks.

TL;DR

AgentSkillOS is the first framework designed to handle the "ecosystem-scale" explosion of agent skills. By organizing over 200,000 skills into a Capability Tree and executing them via Directed Acyclic Graphs (DAGs), it allows agents to solve complex, artifact-rich tasks (videos, web pages, professional docs) that were previously impossible for "flat" single-skill agents.

Positioning: This work is a foundational infrastructure piece. It moves the conversation from "how to build a skill" to "how to manage and compose 200k skills."

The "Flat" Invocation Wall

Current LLM agents typically interact with tools or "skills" in a flat manner: the model is given a list, and it chooses one. However, as of February 2026, there are over 280,000 publicly available skills.

The authors identify two fatal flaws in prior work:

Discovery Failure: LLMs cannot "reason" through 200k manual descriptions at once. Semantic search (RAG) often misses non-obvious but functionally superior skills.
Orchestration Failure: Native agents struggle to manage data flow between multiple tools. They lose track of dependencies, leading to "fragmented" outputs rather than a cohesive project (like a full presentation with custom animations).

Methodology: Managed Discovery and Graph Execution

1. Capability Tree Construction (Manage Skills)

Instead of a flat list, AgentSkillOS recursively partitions skills into a hierarchy. Starting from five root categories (Content Creation, Data Processing, etc.), it uses an LLM to discover sub-groups and assign skills until leaf nodes are reached. This supports coarse-to-fine localization, allowing the agent to "zoom in" on a capability field.

AgentSkillOS Workflow

2. DAG-based Orchestration (Solve Tasks)

Once skills are retrieved, the framework doesn't just hand them to the model. It builds a Directed Acyclic Graph (DAG).

Quality-First: Adds stages for preparation and refinement.
Efficiency-First: Maximizes parallelism (e.g., generating 5 images simultaneously).
Simplicity-First: Minimizes the footprint for speed.

Experimental Proof: Orchestration is the Key

The team constructed a benchmark of 30 "artifact-rich" tasks. They didn't just measure "Pass/Fail," but used a Bradley-Terry Model for pairwise comparison of result quality (a much more rigorous standard for creative work).

| Ecosystem Size | Method | Bradley-Terry Score | | :--- | :--- | :--- | | 200 | Quality-First | 100.0 | | 200 | w/ Full Pool (Flat) | 24.3 | | 200K | Quality-First | 100.0 | | 200K | w/ Full Pool (Flat) | 17.2 |

The gap is staggering. Even with 200k skills available, the "Flat" agent performed poorly because it became "blind" to the right tools.

Performance Radar Charts The radar charts show that AgentSkillOS variants (large polygons) maintain balanced capabilities across Data, Document, Video, Visual, and Web tasks, while flat baselines collapse as complexity grows.

Qualitative Leap

The difference isn't just numerical; it's visual.

Vanilla agent: Produces basic Matplotlib plots.
AgentSkillOS: Invokes internal Manim skills to produce high-fidelity mathematical animations with smooth transitions and professional annotations.

Qualitative Comparison

Critical Insights

The most profound takeaway is that structured composition is the real "Intelligence" multiplier. Even when the agent was given the "Oracle" (perfect) set of skills, if it invoked them in a flat sequence, it failed to match the quality of the DAG-orchestrated approach.

Limitations & Future Work

Skill Quality: The framework assumes skills are high-quality. Autonomously "evaluating" the 3rd-party skills before inclusion remains an open challenge.
Self-Evolution: The authors suggest that because skills are Markdown-based, agents should eventually start editing their own skill files to optimize execution.

Conclusion

AgentSkillOS provides the "Operating System" for the agent era. It proves that as AI tools become more decentralized and massive, the winner won't be the one with the most tools, but the one with the best "Manager" to organize and orchestrate them.

发现相似论文

试试这些示例

Search for recent papers published after 2025 that address the "needle in a haystack" retrieval problem specifically within large-scale tool or API ecosystems for LLM agents.
Which study first introduced the concept of "agent skills" as markdown-defined executable folders, and how does AgentSkillOS extend that original architectural definition?
Explore research that applies Directed Acyclic Graph (DAG) orchestration to multi-modal agent tasks involving video generation (Manim) and professional document synthesis.

[CVPR 2026] AgentSkillOS: Scaling the Skill Ecosystem through Hierarchical Discovery and DAG Orchestration

1. TL;DR

2. The "Flat" Invocation Wall

3. Methodology: Managed Discovery and Graph Execution

3.1. 1. Capability Tree Construction (Manage Skills)

3.2. 2. DAG-based Orchestration (Solve Tasks)

4. Experimental Proof: Orchestration is the Key

5. Qualitative Leap

6. Critical Insights

6.1. Limitations & Future Work

7. Conclusion