Implicit Intelligence -- Evaluating Agents on What Users Don't Say

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

[Applied ML 2026] Implicit Intelligence: Why AI Agents Fail at What You *Don't* Say

总结

问题

方法

结果

要点

摘要

The paper introduces "Implicit Intelligence," a benchmark designed to evaluate AI agents on their ability to infer and satisfy unstated user requirements. It features "Agent-as-a-World" (AaW), a simulator where LLMs act as deterministic world models using YAML-defined environments, and benchmarks 16 frontier models across 205 complex scenarios.

TL;DR

The next frontier for AI agents isn't better tool-use or longer planning—it's Implicit Intelligence. A new study from Labelbox researchers reveals that even our strongest frontier models (like GPT-5.2-pro) fail over 50% of the time when faced with tasks that require inferring unstated constraints. By introducing the Agent-as-a-World (AaW) framework, they've exposed a critical weakness: agents are excellent "instruction followers" but poor "goal fulfillers."

The "Literalist" Problem: Why SOTA Models are Failing

Most LLM benchmarks (like SWE-bench or WebArena) provide a "Ground Truth" that is fully specified. The agent succeeds if it does exactly what the prompt says. However, in the real world, a request like "Delete old documents to free up space" implicitly means "Don't touch my active projects or files without backups."

The authors argue that current models suffer from "Specification Gaming." They optimize for the literal prompt while blindly violating safety, privacy, and accessibility boundaries that any reasonable human would assume.

Methodology: Agent-as-a-World (AaW)

To test this, the researchers moved away from heavy, hard-coded simulators. Instead, they proposed Agent-as-a-World, where the environment is defined in a YAML file and simulated by a high-consistency LLM (Claude Opus 4.5).

The Framework Architecture

The environment is not just a text prompt; it’s a dynamic system with:

Entities & State: Apps and services with real-time variables.
Execution Rules: Hidden logic (e.g., "Toggling mono audio pauses playback") that the agent must discover.
Evaluation Rubric: Hard, binary checks on the final world state.

Agent-as-a-World Architecture

The challenge lies in "Discoverability." The constraint isn't in the prompt; it's hidden in the environment. For example, if an agent is told to "set an alarm," it should first check if one already exists to avoid duplication—an Implicit Reasoning task.

Experimental Battleground: 16 Models, 205 Traps

The researchers tested everything from GPT-4.1 to GPT-5.2-pro and various open-weight models. The results were a wake-up call for the industry.

Key Findings:

The Ceiling is Low: No model passed even half of the scenarios. GPT-5.2-pro led with 48.3% SPR.
Scaling != Intelligence: Interestingly, GPT-5 outperformed GPT-5.1. This suggests that "Implicit Intelligence" might be sensitive to specific fine-tuning or alignment recipes rather than just more parameters.
Open-weight Struggle: Open-weight models like Llama 4 and DeepSeek struggled immensely with Catastrophic Risk, often executing dangerous deletions without a second thought.

Main Results Table

Deep Dive: Why Do They Fail?

The paper identifies three systematic failure patterns:

Insufficient Exploration: Agents act on the initial state without "looking around." For instance, they'll change local phone settings for captions while the audio is actually being AirPlayed to a TV.
Incomplete Configuration: When asked to set up "shared AirPods," 89% of models enabled Mono Audio, but only 11% remembered to center the audio balance—a crucial step for a shared experience.
State Preservation: Agents often fail to "revert" temporary changes, like leaving a phone on "Do Not Disturb" long after a meeting has ended.

Critical Insight: The "Thinking" Paradox

Does giving models more "Extended Thinking" (CoA/Reasoning) help? Surprisingly, not really. In many cases, more thinking led to degraded performance. The authors hypothesize that while thinking helps with logic, it doesn't necessarily improve social/contextual intuition. Models might "overthink" and justify a literal interpretation of a prompt rather than pausing to ask, "Is there something I'm missing?"

Conclusion: From Tools to Teammates

The "Implicit Intelligence" benchmark moves the goalposts for AI. Success in 2026 and beyond won't be measured by how many APIs an agent can call, but by its ability to perform Proactive State Queries and Verification of Effective Outcomes.

As the authors conclude: "We must build agents that understand what users mean, not just what they say."

Editor's Note: This research highlights that as LLMs get 'smarter' at math and coding, their 'common sense' in agentic environments remains a fragile frontier.

发现相似论文

试试这些示例

Search for recent papers or benchmarks that evaluate AI agents specifically on "underspecified instructions" or "pragmatic reasoning" in goal-oriented tasks.
What are the primary theoretical foundations for "Agent-as-a-World" (AaW) style simulators using LLMs, and how do they address simulation hallucination or consistency?
Examine research that applies "Theory of Mind" (ToM) frameworks to autonomous agents to improve their inference of implicit user constraints in safety-critical domains.

[Applied ML 2026] Implicit Intelligence: Why AI Agents Fail at What You *Don't* Say

1. TL;DR

2. The "Literalist" Problem: Why SOTA Models are Failing

3. Methodology: Agent-as-a-World (AaW)

3.1. The Framework Architecture

4. Experimental Battleground: 16 Models, 205 Traps

4.1. Key Findings:

5. Deep Dive: Why Do They Fail?

6. Critical Insight: The "Thinking" Paradox

7. Conclusion: From Tools to Teammates