WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[NeurIPS 2025] Gym-Anything: Breaking the Scaling Wall for Computer-Use Agents
总结
问题
方法
结果
要点
摘要

The paper introduces Gym-Anything, a framework that automates the conversion of any software into an interactive Gymnasium-style environment for computer-use agents (CUAs). Using this pipeline, the authors created CUA-World, a massive dataset of over 10,000 long-horizon tasks across 200 diverse software applications, achieving a new SOTA in environment scaling for agent training and evaluation.

TL;DR

Current AI agents excel at simple web tasks but fail in specialized professional software. Gym-Anything solves this by using a multi-agent pipeline to automatically turn any software (from 3D Splicer to SAP) into a training environment. The resulting CUA-World dataset provides 10,000+ tasks grounded in U.S. GDP data, revealing that even the best models (GPT-4/Gemini) are far from mastering "real work" that requires hundreds of steps.

The Motivation: Why Agents Can't Do "Real Work"

If you look at current benchmarks like WebArena or Mind2Web, agents are mostly "playing" in sandboxes: ordering pizza or changing a wallpaper. However, the software that drives the global economy—radiology tools, financial ERPs, and engineering CAD software—remains untouched.

The problem isn't just model intelligence; it's infrastructure. Setting up a professional environment manually takes weeks of expert time. Without diverse, complex environments, we cannot generate the "long-horizon" training data agents need to learn professional workflows.

Methodology: The Creation-Audit Loop

The core insight of Gym-Anything is that environment setup is itself a computer-use task. Instead of humans writing Dockerfiles, the authors use a three-agent system:

  1. Creation Agent (): Researches the software, writes installation/configuration scripts, and populates it with real-world data (e.g., actual clinical CT scans).
  2. Audit Agent (): Acts as an adversary. It ignores the Creation Agent's claims and strictly checks "evidence" (screenshots, logs) to ensure the software isn't stuck on a setup screen or using fake data.
  3. Summarization Agent: Consolidates "learnings" into a shared memory, allowing the system to get faster as it encounters more software.

Gym-Anything Architecture Figure 1: The Phase-based pipeline from GDP-grounded selection to task amplification and VLM verification.

CUA-World: A GDP-Grounded Benchmark

The authors didn't pick software at random. They mapped ~16,600 applications to U.S. GDP data, ensuring the benchmark covers all 22 major occupation groups.

Granular Verification with "Privileged Information"

A major contribution is how they score these agents. Traditional "binary" pass/fail is too noisy. They use a Checklist-based VLM Verifier. Crucially, the verifier has access to Privileged Information—data extracted from the setup scripts that the agent doesn't see (e.g., the exact ground-truth coordinates of a tumor in a medical scan). This allows for partial credit and prevents "hallucinated" success.

Task Examples Table 1: Examples of tasks across AstroImageJ, Apache Writer, and Aerobridge showing the use of Privileged Information for verification.

Experimental Insights

The results are a "reality check" for the field:

  • The Power of Small Models: By distilling teacher trajectories from CUA-World, a 2B model outscored original models 2x its size.
  • The Long-Horizon Gap: On the "CUA-World-Long" split, tasks often require over 200 steps. GPT-4 and Gemini-3 often "give up" or get stuck in retry loops.
  • Test-Time Auditing (TTA): The authors found that agents often stop prematurely, claiming a task is done. By adding an "Auditor" at test-time to give feedback (e.g., "You missed the final export step"), performance on the hardest tasks jumped significantly.

Experimental Results Figure 2: Training data scaling shows a consistent log-linear improvement in performance as more software is added.

Critical Analysis & Conclusion

Gym-Anything shifts the focus from model architecture to data orchestration.

Limitations: While the GDP grounding is brilliant, "sandboxable" software is often the open-source alternative to the industry standard (e.g., GIMP instead of Photoshop). Whether skills transfer from these open alternatives to proprietary ones remains an open question.

Future Work: This framework opens the door for Autonomous RL. Since Gym-Anything can spin up environments and verify success automatically, we can finally move towards "self-evolving" agents that practice professional tasks in the background, much like AlphaGo practiced Go.

Takeaway: If we want agents to move the needle on global GDP, we need them to master the heterogeneous, messy, and long-horizon world of professional software. Gym-Anything provides the first scalable "gym" for that training.

发现相似论文

试试这些示例

  • Search for recent papers that use multi-agent collaboration or "creation-audit" loops to generate synthetic environments or training datasets for LLM agents.
  • Which research founded the "propose-and-amplify" or "self-instruct" paradigms for instruction tuning, and how has this evolved specifically for multi-modal GUI agents?
  • Explore current SOTA benchmarks for "long-horizon" computer-use agents, specifically focusing on those that measure success in Windows, Linux, or Android OS environments.
目录
[NeurIPS 2025] Gym-Anything: Breaking the Scaling Wall for Computer-Use Agents
1. TL;DR
2. The Motivation: Why Agents Can't Do "Real Work"
3. Methodology: The Creation-Audit Loop
4. CUA-World: A GDP-Grounded Benchmark
4.1. Granular Verification with "Privileged Information"
5. Experimental Insights
6. Critical Analysis & Conclusion