The paper introduces Exgentic, a framework and "Unified Protocol" for the systematic evaluation of general-purpose AI agents across diverse, non-specialized environments. It establishes the first Open General Agent Leaderboard, benchmarking 5 agent architectures (e.g., OpenAI Solo, Claude Code) across 6 environments (e.g., SWE-Bench, AppWorld) and 3 frontier LLMs, identifying Claude Opus 4.5 as the current SOTA backbone for general agents.
TL;DR
The promise of "General Agents"—systems that can drop into any environment and just work—has long been hampered by the fact that our benchmarks are silos. Every new task comes with its own API and hidden assumptions. IBM Research's Exgentic framework introduces a Unified Protocol that allows general agents to be tested across 6 major benchmarks simultaneously. The core finding? General agents can now match specialized SOTA, but your choice of "brain" (LLM) matters 50x more than your "skeleton" (agent architecture).
The "Specialization Trap"
Until now, if you wanted to beat SWE-Bench, you built an agent specifically for software engineering. If you wanted to win at AppWorld, you tuned your agent for its specific Python API.
This has created a blind spot: we don't know how well these systems generalize. Traditional benchmarks use "bespoke communication protocols" that force developers to hard-code environment semantics into the agent. This isn't artificial intelligence; it's domain engineering.
Methodology: The Unified Protocol
The authors solve this with a "narrow waist" architecture. Instead of building $N imes M$ adapters for $N$ agents and $M$ benchmarks, they introduce the Unified Protocol.

The protocol breaks every task down into three standardized fields:
- Task: Textual description of the goal.
- Context: Necessary static knowledge (e.g., corporate policies, API documentation).
- Actions: A typed set of operations (e.g.,
bash,search,send_message).
This allows an agent like Claude Code to walk into a customer service benchmark ($ au^2$-Bench) and treat it exactly like a coding task, because the underlying communication (MCP or Tool-calling) is mediated by Exgentic’s adaptors.
The Open General Agent Leaderboard
The paper presents a massive evaluation spanning 90 configurations and costing $22,000 in API fees.

Key Technical Insights:
- Model vs. Scaffold: In a shocking variance decomposition, Model quality explained 28.2% of the success rate, while the Agent Scaffold (ReAct vs. Smolagent vs. OpenAI Solo) explained a mere 0.6%.
- The SOTA Frontier: Claude Opus 4.5 emerged as the undisputed king of general agents, showing high stability across different architectures.
- The Cost of Failure: Failed runs aren't just disappointing; they are expensive. On interaction-heavy tasks like AppWorld, failed tasks took 54% more steps than successful ones, as agents wandered aimlessly before hitting the limit.
Design Components that Actually Matter
What makes a "general" agent work? The authors identify a few critical "scaffolding" components that provide the best ROI:
- Tool Schema Guards: Essential for MCP-based agents. They catch invalid API calls internally, allowing the LLM to self-correct without crashing the environment.
- Tool Shortlisting: When an environment (like AppWorld) has 400+ tools, it breaks most LLM context windows. Shortlisting relevant tools is the difference between a 0% success rate and being competitive.

Critical Analysis & Future Outlook
Takeaway: General agents are no longer a fantasy. The fact that a zero-shot general agent can match a hand-tuned SWE agent is a watershed moment for the industry.
Limitations: The current study is text-heavy. The next frontier for Exgentic must be multimodal integration. As agents move to "Computer Use" (GUI navigation), the Unified Protocol will need to evolve to handle coordinate-based actions and visual observations.
The Verdict: If you are building an agentic system today, stop tuning for a single benchmark. Use the Exgentic protocol to ensure your "General Agent" is actually general.
