WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CUA-SUITE] Beyond Screenshots: Scaling Continuous Video for Professional Desktop Agents
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces CUA-SUITE, a massive-scale ecosystem for training and evaluating Computer-Use Agents (CUAs). It features VIDEOCUA, a dataset of 10,000 human-demonstrated desktop tasks (55 hours of 30 fps video) across 87 applications, outperforming existing open datasets by over 2.5x in frame count.

TL;DR

The field of Computer-Use Agents (CUAs) has long been stuck in a "screenshot-to-action" paradigm. CUA-SUITE breaks this bottleneck by releasing the largest open expert video corpus to date: VIDEOCUA. With 55 hours of 30 fps video, 6 million frames, and dense "Chain-of-Thought" annotations, it provides the temporal and causal depth needed to move agents from simple web-browsers to masters of professional tools like Blender, VS Code, and Krita.

Problem: The "Sparsity" Trap in UI Automation

Most existing datasets (e.g., Mind2Web, ScaleCUA) rely on static screenshots. While efficient to store, they discard the kinematic "glue" of human interaction—the subtle mouse deceleration, the hover effects, and the visual feedback that happens between clicks. Without this, agents remain brittle, especially in complex desktop software where buttons are tiny and layouts are non-standard.

Methodology: The CUA-SUITE Ecosystem

The authors propose a three-tiered architecture to achieve "Full-Stack" intelligence:

  1. VIDEOCUA (The Trajectory Engine): Captures uncut 30 fps expert videos. Every action is logged with millisecond precision, including the moving path of the cursor.
  2. GROUNDCUA (The Perception Engine): 56K human-verified screenshots with 3.6 million UI element labels, focusing on pixel-level precision for specialized software.
  3. UI-VISION (The Diagnostic Bench): A rigorous evaluation suite for grounding, layout understanding, and action planning.

Multi-layered Reasoning Synthesis

To bridge the gap between raw pixels and semantic intent, every step in VIDEOCUA is enriched with 4 layers of annotation (powered by Claude-Sonnet-4.5 synthesis):

  • Observation: Detailed screen state description.
  • Thought: Logical reasoning connecting goals to actions.
  • Action: Natural language grounding of the intent.
  • Reflection: Post-action analysis for self-correction.

System Overview Figure: The CUA-SUITE pipeline—from human expert recording to dense reasoning trajectories.

Experiments: The Reality Check

The researchers evaluated the current state-of-the-art (OpenCUA-32B). The results were a wake-up call for the community:

  • Task Failure Rate: In professional apps (like Krita or FreeCAD), the failure rate remains high (~60%).
  • Spatial Bottleneck: While models are getting better at "what" to click, they struggle with "where" (Spatial Grounding), especially in dense multi-panel layouts.
  • Scaling Performance: Moving from 7B to 32B parameters improved accuracy by 21.2 points, but absolute performance is still far from human-level reliability.

Action Prediction Failures Figure: Common failure modes—cross-panel confusion and misinterpreting dense toolbars.

Impact: The Future of "World Models"

CUA-SUITE isn't just a dataset; it’s a catalyst for several emerging research directions:

  • Continuous Control: Using kinematic traces to train agents that move the mouse like a human (Fitts’s Law).
  • Visual World Models: Predicting the next frame $s_{t+1}$ given $(s_t, a_t)$ to allow agents to "imagine" the result of a click before committing to it.
  • Generalist Screen Parsing: Moving beyond the DOM/HTML limitation to interpret any pixel-based GUI.

Conclusion

CUA-SUITE marks a pivot point in the GUI agent evolutionary tree. By releasing high-fidelity, high-framerate data with semantic reasoning, it provides the "raw material" needed to build truly autonomous digital coworkers.

Key Takeaway: To solve desktop automation, we need to stop treating the screen as a sequence of still images and start treating it as a dynamic, causal environment.

Find Similar Papers

Try Our Examples

  • Search for the latest papers on Visual World Models (VWM) applied specifically to GUI navigation and computer-use tasks.
  • Which study first introduced the Multi-layered Reasoning Trajectory Synthesis for GUI agents, and how does the CUA-SUITE extension improve its linguistic density?
  • Find research that applies continuous spatial control or reinforcement learning from observation (LfO) to bridge the gap between discrete click prediction and human-like cursor movement.
Contents
[CUA-SUITE] Beyond Screenshots: Scaling Continuous Video for Professional Desktop Agents
1. TL;DR
2. Problem: The "Sparsity" Trap in UI Automation
3. Methodology: The CUA-SUITE Ecosystem
3.1. Multi-layered Reasoning Synthesis
4. Experiments: The Reality Check
5. Impact: The Future of "World Models"
6. Conclusion