Anticipatory Planning for Multimodal AI Agents

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Anticipatory Planning for Multimodal AI Agents

[CVPR 2025] TraceR1: Teaching Multimodal Agents to "Think Before They Click"

总结

问题

方法

结果

要点

摘要

TraceR1 is a novel two-stage reinforcement learning framework that enables multimodal AI agents to perform anticipatory planning rather than just reactive execution. By forecasting short-horizon trajectories before acting, it achieves State-of-the-Art (SOTA) performance on benchmarks like OSWorld-Verified and GAIA, significantly outperforming reactive open-source baselines.

TL;DR

Most AI agents are "blind" to the future—they click based on what they see now without considering where they need to be five steps later. TraceR1 changes this by introducing a two-stage Reinforcement Learning (RL) framework that trains agents to forecast entire trajectories. By combining high-level foresight with grounded execution feedback, it elevates open-source models to a level of planning stability previously reserved for proprietary giants like GPT-4o and Claude 4.

Problem: The "Reactive" Bottleneck

In the world of GUI automation and tool-use, the biggest hurdle isn't perception; it's coherence. Traditional agents optimize for the next immediate token or action. While this works for simple tasks, it fails in "long-horizon" scenarios—like booking a complex multi-city flight—where a small mistake in step 2 causes a total collapse by step 10.

Existing methods fall into two traps:

Model-free RL: Focuses on local correctness but lacks global vision.
Model-based Planning: Extremely difficult to build "world models" for visually dense, ever-changing digital interfaces.

Methodology: Anticipatory Planning via TraceR1

The authors argue that an agent should act like a human: visualize a few steps ahead, then refine the immediate move based on reality.

Stage 1: Anticipatory Trajectory Optimization

Instead of just training the model to predict the next action $a_{t}$ , TraceR1 forces the model to predict a sequence $a u_{t : T}$ . It uses Group Relative Policy Optimization (GRPO) with a trajectory-level reward that penalizes:

Plan Deviations: How much the predicted future differs from a known "expert" path.
Action Repetition: Loop-prevention logic to stop agents from clicking the same button endlessly.

TraceR1 Overview Figure 1: TraceR1 enables the planner to see several steps ahead while only executing the immediate next step.

Stage 2: Grounded Reinforcement Fine-tuning

Foresight is useless if you can't hit the target. Stage 2 uses real feedback from "tool agents" (like a GUI executor). If the model predicts a click at $(x, y)$ , but that coordinate misses the button, the RL signal corrects the model's grounding. This ensures that the "vision" of the plan is matched by "precision" in execution.

Experiments: Surpassing the "Reactive" Baselines

TraceR1 was tested across 7 grueling benchmarks, including OSWorld (Desktop) and GAIA (General Assistant).

Key Results:

OSWorld-Verified: Boosted Qwen3-VL-32B performance by over 15%, reaching 41.2% success—competing head-to-head with proprietary GPT-4.1-based systems.
GAIA Benchmark: Achieved a massive +8.7 improvement in Answer Accuracy over the baseline Qwen3-VL-8B.
Ablation Insight: Removing Stage 2 (Gingered Feedback) caused a 6% performance drop, proving that high-level planning must be anchored in physical/digital reality.

Performance Comparison Table 1: TraceR1 vs. Proprietary and Open-Source Models on AndroidWorld and OSWorld.

Deep Insight: The "Prediction Horizon" Sweet Spot

One of the paper's most fascinating findings is the Horizon Trade-off. Training a model to look 5-10 steps ahead significantly improves success. However, if you force the model to predict 20+ steps, performance plummets. Why? Because the uncertainty of the far future accumulates too fast, creating "noisy" rewards that confuse the learning process.

Ablation on γ Figure 2: Analysis of the temporal discount factor and its impact on planning stability.

Conclusion

TraceR1 provides a scalable recipe for building smarter agents. By shifting the focus from "What do I do now?" to "Where is this sequence leading me?", it achieves a level of deliberate reasoning that moves us closer to truly autonomous digital assistants.

Future Work: The authors suggest moving toward "hierarchical planning," where agents can revise their long-term plans mid-execution if the environment changes unexpectedly—essentially adding a "self-correction" loop to the foresight mechanism.

发现相似论文

试试这些示例

Search for recent papers on trajectory-level reinforcement learning or Group Relative Policy Optimization (GRPO) applied to vision-language models for agentic tasks.
Identify the origin of the "Thinking" or "Reasoning-level" fine-tuning in multimodal models and how TraceR1's anticipatory planning differs from standard Chain-of-Thought (CoT) in agents.
Examine how anticipatory planning frameworks like TraceR1 navigate the exploration-exploitation trade-off in dynamic GUI environments compared to model-based world models.

[CVPR 2025] TraceR1: Teaching Multimodal Agents to "Think Before They Click"

1. TL;DR

2. Problem: The "Reactive" Bottleneck

3. Methodology: Anticipatory Planning via TraceR1

3.1. Stage 1: Anticipatory Trajectory Optimization

3.2. Stage 2: Grounded Reinforcement Fine-tuning

4. Experiments: Surpassing the "Reactive" Baselines

4.1. Key Results:

5. Deep Insight: The "Prediction Horizon" Sweet Spot

6. Conclusion