General Agent Evaluation

WisPaper

学术搜索

学术问答

论文订阅

价格

TrueCite

工作空间

Home

Blog

General Agent Evaluation

[IBM Research] Exgentic: Breaking the Specialization Trap in AI Agent Evaluation

总结

问题

方法

结果

要点

摘要

The paper introduces Exgentic, a framework and "Unified Protocol" for the systematic evaluation of general-purpose AI agents across diverse, non-specialized environments. It establishes the first Open General Agent Leaderboard, benchmarking 5 agent architectures (e.g., OpenAI Solo, Claude Code) across 6 environments (e.g., SWE-Bench, AppWorld) and 3 frontier LLMs, identifying Claude Opus 4.5 as the current SOTA backbone for general agents.

TL;DR

The promise of "General Agents"—systems that can drop into any environment and just work—has long been hampered by the fact that our benchmarks are silos. Every new task comes with its own API and hidden assumptions. IBM Research's Exgentic framework introduces a Unified Protocol that allows general agents to be tested across 6 major benchmarks simultaneously. The core finding? General agents can now match specialized SOTA, but your choice of "brain" (LLM) matters 50x more than your "skeleton" (agent architecture).

The "Specialization Trap"

Until now, if you wanted to beat SWE-Bench, you built an agent specifically for software engineering. If you wanted to win at AppWorld, you tuned your agent for its specific Python API.

This has created a blind spot: we don't know how well these systems generalize. Traditional benchmarks use "bespoke communication protocols" that force developers to hard-code environment semantics into the agent. This isn't artificial intelligence; it's domain engineering.

Methodology: The Unified Protocol

The authors solve this with a "narrow waist" architecture. Instead of building $N imes M$ adapters for $N$ agents and $M$ benchmarks, they introduce the Unified Protocol.

Evolution of Agentic Evaluation

The protocol breaks every task down into three standardized fields:

Task: Textual description of the goal.
Context: Necessary static knowledge (e.g., corporate policies, API documentation).
Actions: A typed set of operations (e.g., bash, search, send_message).

This allows an agent like Claude Code to walk into a customer service benchmark ($ au^2$-Bench) and treat it exactly like a coding task, because the underlying communication (MCP or Tool-calling) is mediated by Exgentic’s adaptors.

The Open General Agent Leaderboard

The paper presents a massive evaluation spanning 90 configurations and costing $22,000 in API fees.

Open General Agent Leaderboard Performance

Key Technical Insights:

Model vs. Scaffold: In a shocking variance decomposition, Model quality explained 28.2% of the success rate, while the Agent Scaffold (ReAct vs. Smolagent vs. OpenAI Solo) explained a mere 0.6%.
The SOTA Frontier: Claude Opus 4.5 emerged as the undisputed king of general agents, showing high stability across different architectures.
The Cost of Failure: Failed runs aren't just disappointing; they are expensive. On interaction-heavy tasks like AppWorld, failed tasks took 54% more steps than successful ones, as agents wandered aimlessly before hitting the limit.

Design Components that Actually Matter

What makes a "general" agent work? The authors identify a few critical "scaffolding" components that provide the best ROI:

Tool Schema Guards: Essential for MCP-based agents. They catch invalid API calls internally, allowing the LLM to self-correct without crashing the environment.
Tool Shortlisting: When an environment (like AppWorld) has 400+ tools, it breaks most LLM context windows. Shortlisting relevant tools is the difference between a 0% success rate and being competitive.

Cost-Performance Tradeoffs

Critical Analysis & Future Outlook

Takeaway: General agents are no longer a fantasy. The fact that a zero-shot general agent can match a hand-tuned SWE agent is a watershed moment for the industry.

Limitations: The current study is text-heavy. The next frontier for Exgentic must be multimodal integration. As agents move to "Computer Use" (GUI navigation), the Unified Protocol will need to evolve to handle coordinate-based actions and visual observations.

The Verdict: If you are building an agentic system today, stop tuning for a single benchmark. Use the Exgentic protocol to ensure your "General Agent" is actually general.

发现相似论文

试试这些示例

Find recent papers or benchmarks that attempt to evaluate the "generalization" of AI agents across heterogeneous software and web environments beyond single-domain leaderboards.
Which paper originally proposed the Model Context Protocol (MCP) and how have subsequent works adapted it for cross-platform agent communication?
Search for studies investigating the cost-performance Pareto frontier of LLM-based agents, specifically comparing high-reasoning models like Claude 3.5/4 with high-throughput models in multi-step agentic tasks.

[IBM Research] Exgentic: Breaking the Specialization Trap in AI Agent Evaluation

1. TL;DR

2. The "Specialization Trap"

3. Methodology: The Unified Protocol

4. The Open General Agent Leaderboard

4.1. Key Technical Insights:

5. Design Components that Actually Matter

6. Critical Analysis & Future Outlook