SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SKILLRET: Solving the "Needle in a Skill-Stack" Problem for LLM Agents

总结

问题

方法

结果

要点

摘要

SKILLRET is a large-scale benchmark for skill retrieval in LLM agents, featuring over 17,810 public agent skills and 63,000+ samples. The authors propose a taxonomy-driven evaluation and introduce the SkillRet model family, which achieves a SOTA NDCG@10 of 83.5, significantly outperforming general-purpose retrievers.

TL;DR

As LLM agents move from "toy apps" to complex ecosystems, they face a massive scaling bottleneck: how to find the right script or prompt (a "skill") among thousands of candidates. This paper introduces SKILLRET, the largest benchmark for agent skill retrieval to date. The key finding? General-purpose retrievers (even those topping MTEB) are surprisingly bad at this. By fine-tuning specialized models, the authors achieved an NDCG@10 of 83.5, a massive leap over existing standards.

Background: The Invisible Bottleneck

Modern agents don't just "talk"; they execute. They use "skills"—reusable modules like CI/CD scripts, data analysis workflows, or specialized prompts.

The Scale Problem: You can't fit 17,000 skill descriptions into a Claude or GPT-4 prompt.
The Retrieval Gap: Standard search engines look for keywords. LLM tasks require "intent matching"—understanding that a user asking about "scaling my cloud cluster" needs a specific Kubernetes skill, even if the words don't match exactly.

The SKILLRET Benchmark: 17k Skills, 63k Samples

The authors didn't just crawl GitHub; they built a structured ecosystem.

Filtering: They pruned 22,000+ raw entries down to 17,810 high-quality skills.
Taxonomy: They categorized skills into 6 Major domains (e.g., Software Engineering, AI Agents, Data & ML).
Complex Queries: Using Claude and Qwen, they generated queries that require multiple skills simultaneously, mimicking complex real-world requests.

SKILLRET Pipeline Figure 1: The data generation pipeline involving seed queries, LLM synthesis, and human validation.

Why Off-the-Shelf Models Fail

The paper reveals a "Pre-training vs. Task-Specific" paradox. Models like NV-Embed-v1 (7B parameters), which dominate general leaderboards, were outperformed by the much smaller Harrier-OSS (0.6B) in the skill domain.

The authors argue that skill retrieval is a long-document matching problem. A skill's documentation (the Markdown body) contains the "how-to," but a user's query is often buried in "noise" (e.g., "Hi, I'm working on a project and I noticed that... could you help me with X?").

Methodology: Focus through Fine-Tuning

The authors fine-tuned the Qwen3-Embedding family. The "Secret Sauce" wasn't just more data, but a focus on Intent Isolation.

Through a technique called Sentence Erasure, they proved that fine-tuned models learn to ignore the "fluff" in a user's query and zero in on the exact sentence that demands a skill.

Experimental Results Table 1: Comparison of embedding models. Note the massive jump from off-the-shelf (50-60 NDCG) to SkillRet-trained models (78-83 NDCG).

Key Insights & Results

MTEB is Not Enough: There is only a moderate correlation (ρ=0.71) between general retrieval ability and skill retrieval. You cannot trust general leaderboards for agentic systems.
The Reranking Limit: Rerankers help, but only if the first-stage retriever is already decent. If your base model is "lost," the reranker often makes things worse by introducing domain mismatch.
Hard Categories: "Information Retrieval" and "AI Agents" remain the hardest skills to retrieve, despite fine-tuning. This suggests that as agents get meta (agents helping agents), the language becomes increasingly abstract and hard to index.

Deep Insight: "Actionable Signaling"

The most profound takeaway is the Sentence Erasure Analysis. When the authors "masked" the most important sentence in a query:

Base models saw a 23% drop in performance.
Fine-tuned models saw a 29% drop.

This means the fine-tuned model has a "higher peak"—it relies more heavily on the specific capability signal. It has learned the "language of action."

Conclusion

SKILLRET proves that if we want "agents at scale," we need to treat the retrieval layer as a first-class citizen, not a solved problem. The release of their 0.6B and 8B checkpoints provides a powerful new foundation for anyone building multi-tool or multi-skill agents.

Takeaway for Developers: Stop relying on generic vector search for your agent's tools. Fine-tune your embedding layer on your specific skill library—the performance gains (up to 16.9 points) are too large to ignore.

Sentence Erasure Visualization Figure 2: Importance heatmaps showing how fine-tuned models (Trained) concentrate on relevant signals compared to Base models.

发现相似论文

试试这些示例

Find recent papers on "long-document retrieval" specifically tailored for executable code or agentic workflows beyond the SKILLRET benchmark.
Which paper first proposed the "Self-Instruct" generation method for synthetic query-document pairs, and how does SKILLRET's multi-perspective LLM review improve upon that original framework?
Explore research that applies the "SkillRet" retrieval-centric approach to multimodal agents, such as retrieving visual-motor policies or audio-processing scripts.

SKILLRET: Solving the "Needle in a Skill-Stack" Problem for LLM Agents

1. TL;DR

2. Background: The Invisible Bottleneck

3. The SKILLRET Benchmark: 17k Skills, 63k Samples

4. Why Off-the-Shelf Models Fail

5. Methodology: Focus through Fine-Tuning

6. Key Insights & Results

7. Deep Insight: "Actionable Signaling"

8. Conclusion