The paper introduces Ego2Web, the first benchmark bridging egocentric (first-person) video perception with web agent execution. It features 500 high-quality video-instruction pairs across diverse domains like e-commerce and navigation, evaluated using a novel "Ego2WebJudge" LLM-as-a-Judge framework on live websites.
TL;DR
Ego2Web is a groundbreaking benchmark that challenges AI agents to step out of the digital vacuum and into the user’s physical reality. By pairing egocentric videos (what a user sees through AR glasses/wearables) with online web tasks, it evaluates an agent's ability to identify real-world objects and execute related actions on live websites. Despite the prowess of models like GPT-4o and Gemini, a massive 40% performance gap remains compared to humans, proving that "seeing" and "doing" are still far from integrated.
The Motivation: Why "Digital-Only" Agents are Failing Us
Imagine wearing AR glasses and looking at a specific brand of coffee in a boutique. You tell your AI assistant, "Order a bag of this." Current state-of-the-art (SOTA) agents would likely fail because they are trained on screenshots, not physical context.
Existing benchmarks like VisualWebArena or Mind2Web focus on how to navigate a DOM tree or click buttons based on text instructions. They miss the "Why": the trigger for the action often lies in the unstructured, dynamic physical world surrounding the user. Ego2Web was born to fix this "missing link" between egocentric perception and digital execution.
Methodology: Grounding the Digital in the Physical
The authors created a sophisticated LLM + Human collaborative pipeline to ensure high-quality data:
- Visual Parsing: Using MLLMs (like Qwen3-VL) to turn raw video into structured "video profiles" (timestamped captions of objects and actions).
- Task Synthesis: An LLM planner (GPT-5) looks at the video profile and active websites (Amazon, YouTube, etc.) to create instructions that cannot be solved without the video.
- Human Verification: Humans check for "Visual Grounding" (is the item actually in the video?), "Web Feasibility," and instruction clarity.
Figure 1: The Ego2Web task flow—from first-person visual cue to web execution.
Ego2WebJudge: A More Rigorous Evaluator
Standard evaluation methods often rely on final text matching, which is prone to "hallucinated success." The authors introduced Ego2WebJudge, an evaluator that specifically looks at:
- Key-Point Identification: What must be achieved?
- Key Screenshot Selection: Filtering out "noise" actions (like loading pages).
- Visual Consistency: Does the final product page on Amazon actually match the brand and color seen in the 3rd minute of the video?
Experiments: A Reality Check for LLM Agents
The researchers tested 6 mainstream agents, including Claude 3.5/4.5 Sonnet, GPT-5.4, and Gemini-3-Flash.
1. The Performance Ceiling
The best performing agent, Browser-Use (BU) with Gemini-3-Flash, only reached 58.6% success (Human Eval). Other models like Claude and GPT-5 variants struggled significantly because they often lack direct raw video access, relying instead on "lossy" text captions.
2. The Power of "Raw Video" vs. "Captions"
A critical ablation study (Table 5) showed that agents using raw video performed 2x better (48.2% SR) than those using text-based captions (23.6% SR). This proves that text is an insufficient proxy for the rich, dynamic information contained in first-person video.
Table 2: Comparison of success rates across major agents and evaluation methods.
Critical Analysis & Future Outlook
The paper identifies Object Misidentification (36%) and Temporal Misunderstanding (18%) as the primary killers of agent success. Agents often get confused about which object was picked up second or third in a sequence.
Limitations & Takeaways
- Sequential Bottleneck: Errors in temporal grounding at the start of a task propagate through the entire web navigation, leading to "compositional failure."
- The "Caption" Trap: Developers should stop relying on intermediate captioning layers and move toward end-to-end video-to-action models.
Conclusion: Ego2Web is a wake-up call for the AI community. If we want agents that live in our pocket or on our face (via AR), they must learn to reconcile the messy, temporal physical world with the structured digital one. The benchmark is a significant step toward truly embodied AI assistants.
