WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[KDD 2026] MobilityBench: Stress-Testing LLM Agents Against the Chaos of Real-World Navigation
Summary
Problem
Method
Results
Takeaways
Abstract

MobilityBench is a comprehensive benchmark designed to evaluate LLM-based route-planning agents in real-world mobility scenarios. It features 100,000 episodes derived from anonymized Amap user queries and utilizes a deterministic API-replay sandbox to ensure reproducible results across various tasks, including point-to-point and preference-constrained routing.

TL;DR

MobilityBench is a new, large-scale benchmark (100k episodes) focused on evaluating how LLM agents handle the "messy" reality of daily human mobility. By using real-world queries from Amap and a unique deterministic API-replay sandbox, it provides a rigorous framework to measure instruction understanding, complex planning, and tool-use efficiency.

Positioning: This is a critical "reality check" for agents, moving from abstract travel planning to high-fidelity, map-constrained decision making.

Problem & Motivation: The Reproducibility Crisis in Agent Evals

The current state of agent evaluation is plagued by a "moving target" problem. If you test a navigation agent today and again tomorrow using live APIs, the results will differ because of traffic changes or API updates. This makes it impossible to tell if a model's improvement is due to better reasoning or just a "lucky" traffic day.

Furthermore, existing benchmarks often ignore the "Preference Gap". Users don't just ask "How do I get to point B?"; they ask "How do I get to point B while avoiding tolls, stopping for coffee, and using the subway?" Current SOTA models frequently trip over these multi-layered constraints.

Methodology: The Architecture of Reliability

The core innovation of MobilityBench is its Deterministic Replay Sandbox.

  1. Episode-centric Design: Each task consists of a query, spatial context, and a frozen snapshot of API responses.
  2. API-Replay: When an agent calls a tool (e.g., driving_planning), the sandbox intercepts it and returns a cached response. This "freezes" the world state, ensuring fair comparisons between a GPT-4 and a Qwen3-235B.

MobilityBench Workflow

The authors split tasks into four high-level families:

  • Basic Retrieval: "Where is the nearest gas station?"
  • Route-Dependent Info: "How long is the commute to the airport?"
  • Basic Planning: Point A to Point B.
  • Preference-Constrained Planning: The "Hard Mode" involving specific options like "avoiding highways" or "fewer transfers."

Experiments: ReAct vs. Plan-and-Execute

The study compared two dominant agent paradigms:

  • ReAct: A "Think-Act-Observe" loop.
  • Plan-and-Execute: Create a full plan first, then fire off tools.

Key Insight: ReAct generally achieves a higher Final Pass Rate (FPR) because it can self-correct when a tool returns an unexpected result (e.g., "Road Closed"). However, this robustness comes at a high Efficiency Cost—ReAct consumes ~35% more input tokens due to its growing conversational history.

Performance Comparison

The "Thinking" Multiplier

The authors also tested "Thinking" models (like DeepSeek-R1 or Qwen-Thinking). Enabling an internal chain-of-thought consistently improved performance across the board. For Qwen-30B-A3B, "Thinking" increased the success rate by nearly 6% absolutely, though it significantly increased latency.

Thinking vs Non-Thinking

Deep Insight: Why Do Agents Fail?

The multi-dimensional evaluation reveals a specific bottleneck: Preference-Constrained Planning. Even the best models often ignore subtle user instructions (e.g., "avoiding Inner Ring Elevated Road") because they tend to default to the most "likely" path provided by the routing engine, failing to strictly enforce the user's semantic constraints over the tool's output.

Conclusion & Future Look

MobilityBench proves that while we are close to having competent "Retrieval Assistants," we are still far from having reliable "Mobility Agents" that can navigate the nuances of human preference. The release of this benchmark (and its 100k episodes) provides the community with the scale needed to fine-tune models specifically for high-stakes, real-world navigation.

Check out the toolkit at: https://github.com/AMAP-ML/MobilityBench

Find Similar Papers

Try Our Examples

  • Find recent papers or SOTA methods that specifically address solving long-tail user preference constraints in LLM-based autonomous agents.
  • Which paper originally proposed the ReAct (Reasoning and Acting) framework for LLMs, and how has this specific study modified it for spatial navigation tasks?
  • Explore research that applies deterministic API simulation or sandbox environments to evaluate LLM agents in other dynamic domains like financial trading or web shopping.
Contents
[KDD 2026] MobilityBench: Stress-Testing LLM Agents Against the Chaos of Real-World Navigation
1. TL;DR
2. Problem & Motivation: The Reproducibility Crisis in Agent Evals
3. Methodology: The Architecture of Reliability
4. Experiments: ReAct vs. Plan-and-Execute
4.1. The "Thinking" Multiplier
5. Deep Insight: Why Do Agents Fail?
6. Conclusion & Future Look