The paper introduces Gym-Anything, a framework that automates the conversion of any software into an interactive Gymnasium-style environment for computer-use agents (CUAs). Using this pipeline, the authors created CUA-World, a massive dataset of over 10,000 long-horizon tasks across 200 diverse software applications, achieving a new SOTA in environment scaling for agent training and evaluation.
TL;DR
Current AI agents excel at simple web tasks but fail in specialized professional software. Gym-Anything solves this by using a multi-agent pipeline to automatically turn any software (from 3D Splicer to SAP) into a training environment. The resulting CUA-World dataset provides 10,000+ tasks grounded in U.S. GDP data, revealing that even the best models (GPT-4/Gemini) are far from mastering "real work" that requires hundreds of steps.
The Motivation: Why Agents Can't Do "Real Work"
If you look at current benchmarks like WebArena or Mind2Web, agents are mostly "playing" in sandboxes: ordering pizza or changing a wallpaper. However, the software that drives the global economy—radiology tools, financial ERPs, and engineering CAD software—remains untouched.
The problem isn't just model intelligence; it's infrastructure. Setting up a professional environment manually takes weeks of expert time. Without diverse, complex environments, we cannot generate the "long-horizon" training data agents need to learn professional workflows.
Methodology: The Creation-Audit Loop
The core insight of Gym-Anything is that environment setup is itself a computer-use task. Instead of humans writing Dockerfiles, the authors use a three-agent system:
- Creation Agent (): Researches the software, writes installation/configuration scripts, and populates it with real-world data (e.g., actual clinical CT scans).
- Audit Agent (): Acts as an adversary. It ignores the Creation Agent's claims and strictly checks "evidence" (screenshots, logs) to ensure the software isn't stuck on a setup screen or using fake data.
- Summarization Agent: Consolidates "learnings" into a shared memory, allowing the system to get faster as it encounters more software.
Figure 1: The Phase-based pipeline from GDP-grounded selection to task amplification and VLM verification.
CUA-World: A GDP-Grounded Benchmark
The authors didn't pick software at random. They mapped ~16,600 applications to U.S. GDP data, ensuring the benchmark covers all 22 major occupation groups.
Granular Verification with "Privileged Information"
A major contribution is how they score these agents. Traditional "binary" pass/fail is too noisy. They use a Checklist-based VLM Verifier. Crucially, the verifier has access to Privileged Information—data extracted from the setup scripts that the agent doesn't see (e.g., the exact ground-truth coordinates of a tumor in a medical scan). This allows for partial credit and prevents "hallucinated" success.
Table 1: Examples of tasks across AstroImageJ, Apache Writer, and Aerobridge showing the use of Privileged Information for verification.
Experimental Insights
The results are a "reality check" for the field:
- The Power of Small Models: By distilling teacher trajectories from CUA-World, a 2B model outscored original models 2x its size.
- The Long-Horizon Gap: On the "CUA-World-Long" split, tasks often require over 200 steps. GPT-4 and Gemini-3 often "give up" or get stuck in retry loops.
- Test-Time Auditing (TTA): The authors found that agents often stop prematurely, claiming a task is done. By adding an "Auditor" at test-time to give feedback (e.g., "You missed the final export step"), performance on the hardest tasks jumped significantly.
Figure 2: Training data scaling shows a consistent log-linear improvement in performance as more software is added.
Critical Analysis & Conclusion
Gym-Anything shifts the focus from model architecture to data orchestration.
Limitations: While the GDP grounding is brilliant, "sandboxable" software is often the open-source alternative to the industry standard (e.g., GIMP instead of Photoshop). Whether skills transfer from these open alternatives to proprietary ones remains an open question.
Future Work: This framework opens the door for Autonomous RL. Since Gym-Anything can spin up environments and verify success automatically, we can finally move towards "self-evolving" agents that practice professional tasks in the background, much like AlphaGo practiced Go.
Takeaway: If we want agents to move the needle on global GDP, we need them to master the heterogeneous, messy, and long-horizon world of professional software. Gym-Anything provides the first scalable "gym" for that training.
