KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

[CVPR 2026] KnowU-Bench: Why Your Mobile Agent Can Navigate, But Still Doesn't "Know" You

总结

问题

方法

结果

要点

摘要

KnowU-Bench is a novel online evaluation framework for personalized and proactive mobile agents, featuring a reproducible Android emulation environment. It introduces 192 tasks across general, personalized, and proactive categories, utilizing an LLM-driven user simulator to evaluate real-time preference elicitation and decision-making beyond simple instruction following.

TL;DR

While modern Large Multimodal Models (LMMs) have mastered the art of clicking buttons on a smartphone, they are still remarkably "socially inept" as personal assistants. KnowU-Bench is a new benchmark that exposes this gap, shifting the focus from mere instruction following to interactive personalization and proactive assistance. It reveals that even frontier models like Claude Sonnet 4.6 fail more than half the time when instructions are vague or require knowing when to stay silent.

The "Competency-Intelligence" Gap

Most existing benchmarks (e.g., AndroidWorld) provide explicit goals: "Open Spotify and play Jazz." But real users are messy. They say things like "Order me lunch," expecting the agent to know about their peanut allergy, their $20 budget, and their preference for Tuantuan over Meituan.

The authors identify three fatal flaws in current research:

Static Personalization: Treating user history as a fixed text prompt rather than an evolving interaction.
Lack of Elicitation: Models don't know how to ask, "Hey, do you want the usual coffee, or are we trying something new today?"
Miscalibrated Proactivity: Agents either do nothing when they should help (False Passivity) or, worse, start performing sensitive tasks without consent (Unwarranted Intervention).

Methodology: The Hidden Profile & User Simulator

KnowU-Bench introduces a sophisticated "Two-Agent" setup to simulate human-AI partnership:

The GUI Agent: Has access to visual screenshots and a User Activity Log (historical behavior). It does not see the user's explicit preference profile.
The User Simulator (LLM-driven): Holds the Hidden Profile (habits, social graph, dietary constraints). It acts as the "ground truth" human, responding to the agent's questions and judging its proactive moves.

Overall Architecture Figure 1: The KnowU-Bench framework coupling the Android Emulation environment with a profile-grounded User Simulator.

The Proactive Decision Chain

The benchmark measures whether an agent can correctly navigate the Full Proactive Chain:

Trigger: Recognizing a situation (e.g., "User is late for a meeting").
Strategy Selection: Should I execute silently? Ask for consent? Or stay silent?
Restraint: If the user says "No," does the agent stop, or does it stubbornly keep trying (Post-Rejection Violation)?

Experimental Analysis: A Reality Check

The results are a "wake-up call" for the industry. While models like MAI-UI-8B achieve 100% success on simple General tasks, their performance collapses in Personalized and Proactive settings.

Performance Comparison Table 1: The sharp decline in Success Rate (SR) across General vs. Personalized vs. Proactive splits.

Key Insight 1: Preference Acquisition is the New Bottleneck. For Claude Sonnet 4.6, 66.7% of personalized failures were "Clarification Errors." Models aren't failing because they can't find the "Order" button; they fail because they don't realize they need to ask the user a question first.

Key Insight 2: Role Dependence Matters. Agents perform significantly worse when acting for a "Grandma" role compared to a "Researcher." This suggests that model biases towards tech-savvy personas are baked into their training, failing the "digital inclusion" test for less traditional users.

Deep Insight: Proactive Safety

The study categorizes proactive policy success into Act (helping when needed), Silent (not disturbing), and Stop (obeying rejection).

Claude-3.5-Sonnet is the most balanced.
Qwen3.5-397B is "overly cautious"—it excels at staying silent but fails to help even when the routine is obvious.

Proactive Metrics Figure 2: Analysis of proactivity—Act, Silent, and Stop rates across different models.

Critical Analysis & Conclusion

KnowU-Bench successfully moves the goalposts. It proves that LLM-as-a-Judge, when combined with rule-based environment checks, is a viable way to evaluate the "soft skills" of AI agents.

Limitations: The "noisy logs" introduced are relatively simple (25% irrelevant events). In hardware-constrained real-world scenarios, the "noise" in user history is likely much higher, requiring even more robust retrieval-augmented generation (RAG) strategies.

The Takeaway: To transition from "GUI Operators" to "Trustworthy Assistants," agents need more than better OCR or faster inference. They need Epistemic Humility—the ability to recognize what they don't know about a user and the wisdom to ask before they act.

发现相似论文

试试这些示例

Find recent papers from 2025-2026 that focus on long-term memory retrieval and RAG architectures specifically optimized for mobile GUI agent personalization.
Which studies first introduced the "LLM-driven user simulator" for multi-turn agent evaluation, and how does KnowU-Bench's grounded simulator differ in terms of state-conditional feedback?
Search for research exploring "safety-aware proactive agents" that utilize reinforcement learning to calibrate the trade-off between autonomous intervention and user disturbance in digital environments.

[CVPR 2026] KnowU-Bench: Why Your Mobile Agent Can Navigate, But Still Doesn't "Know" You

1. TL;DR

2. The "Competency-Intelligence" Gap

3. Methodology: The Hidden Profile & User Simulator

4. The Proactive Decision Chain

5. Experimental Analysis: A Reality Check

6. Deep Insight: Proactive Safety

7. Critical Analysis & Conclusion