WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[ICLR 2025] WebNavigator: Curing "Topological Blindness" via Interaction Graph Retrieval
Summary
Problem
Method
Results
Takeaways
Abstract

WebNavigator is a novel framework that transforms autonomous web navigation from probabilistic trial-and-error into a deterministic Retrieve-Reason-Teleport workflow. By constructing offline Interaction Graphs, it enables agents to achieve SOTA performance, including a 72.9% success rate on WebArena multi-site tasks, doubling the performance of previous enterprise-level agents.

TL;DR

WebNavigator moves beyond the "reactive" paradigm of web agents (like ReAct) by providing them with a persistent Interaction Graph. By reframing navigation as deterministic retrieval and pathfinding rather than probabilistic guessing, it doubles the success rate on complex multi-site tasks (reaching 72.9%) while minimizing token expenditure and interaction steps.

The Core Insight: Topological Blindness

Current SOTA agents often fail not because they lack "reasoning" but because they are Topologically Blind. Imagine trying to find a specific room in a massive, dark skyscraper using only a flashlight: you can see what is right in front of you (local observations), but you have no map of the building (global topology).

Previous paradigms attempted to fix this via:

  1. Online Search (MCTS/BFS): Hyper-expensive in terms of tokens and time; the agent "reinvents the wheel" for every new task.
  2. World Models: High risk of "hallucinating" page transitions that don't exist in reality.

WebNavigator's solution? Give the agent the map before it starts.

Methodology: The Two-Phase Paradigm

Phase I: Building the Mental Map (Offline)

WebNavigator uses a Heuristic Auto-Exploration Engine. Unlike standard crawlers that only follow static links, this engine interacts with dynamic elements (buttons, menus) to capture the full state space.

  • Adaptive BFS: It uses "Structural Differencing" to avoid re-exploring similar pages, focusing only on new DOM elements.
  • Zero-Token Cost: This phase requires no LLM, saving massive costs during the "crawling" stage.

Phase II: The Global-View Navigator (Online)

When a task begins, the agent follows a Retrieve-Reason-Teleport workflow:

  1. Retrieve: The agent identifies the goal (e.g., "the customer checkout page") and retrieves the Top-K most similar screenshots from the Interaction Graph using fine-grained Multimodal Retrieval (Jina-v4).
  2. Reason: A multimodal LLM (the Selector) acts as a verifier to pick the single best target node.
  3. Teleport: The system calculates the shortest mathematical path on the graph and executes it automatically—teleporting the agent to the workspace at once.

Overview of WebNavigator Architecture

Proving the Power of Global Visibility

In the WebArena Multi-site benchmark—the "Final Boss" of web navigation—WebNavigator achieved a 72.9% Success Rate. This is more than double the performance of high-end enterprise systems like CUGA.

Case Study: Cross-Domain Synergy

Consider a task: "Find the customer's address in the CMS and then check the route on the Map."

  • Reactive Agents: Often get lost in the CMS or fail to bridge the two domains.
  • WebNavigator: Uses the navigate(domain, query) action to instantly jump to the Customer page in the CMS domain, then switches to the Map domain and teleports to the Route page.

Comparison of Agent Trajectories

Detailed Performance Breakdown

The table below highlights that WebNavigator's gains are most significant in "Deep" environments like Reddit (shallow but wide) and GitLab/CMS (deep and complex), where global visibility prevents the agent from falling into navigation traps.

Performance Comparison Table

Analysis: Is the Web "Infinite"?

A common counter-argument is that the web is too large to map. The authors debunk this by uncovering the Topological Skeleton. While content (products, posts) is infinite, the interaction logic is compact. Most websites have a core structure of fewer than 1,000 unique interaction states. By targeting this skeleton, WebNavigator proves that global planning is not only possible but highly efficient.

Conclusion & Future Outlook

WebNavigator shifts the burden of navigation from LLM Reasoning (which is fallible and expensive) to Structured Knowledge Retrieval.

Key Takeaways for Developers:

  • Stop asking LLMs to "guess" the next button.
  • Start indexing your GUI as an Interaction Graph.
  • The future of agents lies in Capability Aggregation—using specialized tools like navigate to handle the "how" so the LLM can focus on the "what."

Limitations: The framework currently relies on a pre-constructed graph. For extremely volatile websites that change structure hourly, incremental update mechanisms (as discussed in the paper) will be critical.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize Interaction Graphs or Topological Maps to solve long-horizon navigation in Graphical User Interfaces (GUIs).
  • Which study first introduced the concept of "Topological Blindness" in the context of LLM agents, and how did it influence subsequent world-model research?
  • Explore research applying Retrieve-Reason-Teleport architectures or similar retrieval-augmented execution to autonomous mobile robot navigation or software engineering agents.
Contents
[ICLR 2025] WebNavigator: Curing "Topological Blindness" via Interaction Graph Retrieval
1. TL;DR
2. The Core Insight: Topological Blindness
3. Methodology: The Two-Phase Paradigm
3.1. Phase I: Building the Mental Map (Offline)
3.2. Phase II: The Global-View Navigator (Online)
4. Proving the Power of Global Visibility
4.1. Case Study: Cross-Domain Synergy
4.2. Detailed Performance Breakdown
5. Analysis: Is the Web "Infinite"?
6. Conclusion & Future Outlook