WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[State-of-the-Art] UI-Voyager: Outperforming Humans with a Self-Evolving 4B GUI Agent
总结
问题
方法
结果
要点
摘要

UI-Voyager is a self-evolving mobile GUI agent framework that addresses long-horizon task automation using a two-stage optimization pipeline: Rejection Fine-Tuning (RFT) and Group Relative Self-Distillation (GRSD). A 4B parameter model trained with this method achieved a state-of-the-art 81.0% success rate on the AndroidWorld benchmark, surpassing both massive 235B models and human-level performance (80.0%).

The quest for autonomous mobile GUI agents has often been a story of "large vs. smart." While massive models like Gemini or Qwen-VL-235B show raw power, they often stumble in the nuances of long-horizon mobile tasks where one wrong click—out of 30—can lead to total failure.

Enter UI-Voyager, a novel framework that demonstrates how a compact 4B parameter model can not only compete with giants but also surpass human-level performance on the rigorous AndroidWorld benchmark (81.0% vs. 80.0%).

TL;DR

UI-Voyager is a two-stage self-evolving GUI agent. By combining Rejection Fine-Tuning (RFT) with a novel Group Relative Self-Distillation (GRSD) mechanism, it learns from both successful and failed trajectories to provide dense, step-level supervision. This effectively solves the "credit assignment" problem—identifying exactly where a model went wrong in a long sequence of actions.


The Core Bottleneck: "Where Did I Fail?"

Navigating a mobile app is a sequence of actions (clicks, swipes, types). Existing GUI agents face two major issues:

  1. Inefficient Learning from Failure: Failed attempts are usually discarded, wasting valuable context.
  2. Sparse Rewards: If a task fails at step 25 of 30, a standard RL agent only knows it failed at the end. It doesn't know which step was the culprit.

Methodology: The Two-Stage Evolution

UI-Voyager tackles these bottlenecks in two distinct phases:

Phase 1: Rejection Fine-Tuning (RFT)

The model acts as its own data generator. It generates multiple attempts for a task; only the successful ones are kept via a "rule-based verifier." These successes are then used for Supervised Fine-Tuning (SFT) in an iterative loop. This "warm-starts" the model to a high performance level (jumping from 37% to 73% success in three rounds).

Phase 2: Group Relative Self-Distillation (GRSD)

This is the "secret sauce." When an agent tries a task multiple times, it creates a "group" of trajectories—some succeed, some fail. UI-Voyager identifies Fork Points: moments where a failed trajectory and a successful one saw the same screen but chose different actions.

Overall Architecture
Figure: The UI-Voyager pipeline showing RFT and GRSD stages.

How Fork Point Detection Works

Using SSIM (Structural Similarity Index), the model compares screenshots from failed and successful runs. If it finds two frames that are semantically identical (the same UI state) but the actions taken afterwards diverge, it labels this a "fork point."

  • Teacher: The successful path's action at that state.
  • Student: The failed path's history/context. The model is then trained to predict the "teacher's" action given the "student's" context.

Fork Point Logic
Figure: Illustration of identifying fork points to extract dense supervision.


Experimental Results: Size Isn't Everything

The results on the AndroidWorld (116 tasks) benchmark are striking. UI-Voyager (4B) beats models that are 50 times its size.

| Model | Params | Success Rate | | :--- | :--- | :--- | | Gemini-2.5-Pro | - | 69.7% | | UI-Tars-2 | 230B | 73.3% | | Human Expert | - | 80.0% | | UI-Voyager | 4B | 81.0% |

Performance Curve
Figure: Comparison of RFT iterations and standard RL (PPO/GRPO). Standard RL methods often plateau, while UI-Voyager continues to improve through GRSD.

Case Study: Solving BrowserMaze

In a navigation task called "BrowserMaze," the agent must move a cursor to a target. A failed run might try to move right into a wall. By identifying the fork point at Step 12—where the screen was identical to a successful run that moved "Down"—UI-Voyager learns the correct logic ("I need to move down, not right") without any human labels.


Critical Analysis & Conclusion

Takeaways: UI-Voyager proves that dense self-distillation is a superior alternative to sparse-reward RL for long-horizon agent tasks. By using its own successful rollouts as "local teachers," it sidesteps the sample inefficiency of PPO/GRPO.

Limitations:

  • Action Space: The agent operates on high-level primitives (click, swipe). Real-world apps might require more nuanced continuous gestures.
  • Visual Noise: SSIM matching can be tricked by transient UI elements (like a blinking cursor or a toast notification). Future work could use OCR or accessibility tree structural signals to improve state matching.

Future Insight: This paradigm and its high performance suggest we are nearing a point where 4B-8B parameter models can reliably serve as on-device personal assistants, capable of self-correcting their errors based on their past successes.

发现相似论文

试试这些示例

  • Which recent papers explore "Fork Point Detection" or state equivalence in multimodal GUI agents to improve credit assignment?
  • What are the original theoretical foundations of Rejection Fine-Tuning (RFT) and Group Relative Policy Optimization (GRPO) that this work builds on?
  • How can SSIM-based matching be generalized or replaced with learned vision-language embeddings for more robust state equivalence in asynchronous GUI environments?
目录
[State-of-the-Art] UI-Voyager: Outperforming Humans with a Self-Evolving 4B GUI Agent
1. TL;DR
2. The Core Bottleneck: "Where Did I Fail?"
3. Methodology: The Two-Stage Evolution
3.1. Phase 1: Rejection Fine-Tuning (RFT)
3.2. Phase 2: Group Relative Self-Distillation (GRSD)
3.3. How Fork Point Detection Works
4. Experimental Results: Size Isn't Everything
4.1. Case Study: Solving BrowserMaze
5. Critical Analysis & Conclusion