How Well Does Agent Development Reflect Real-World Work?

学术搜索

学术问答

价格

TrueCite

How Well Does Agent Development Reflect Real-World Work?

[CMU/Stanford] How Well Does Agent Development Reflect Real-World Work?

总结

问题

方法

结果

要点

摘要

This paper presents a systematic study on the alignment between AI agent benchmarks and the real-world U.S. labor market by mapping 43 benchmarks (72,342 tasks) to O*NET occupational taxonomies. The researchers reveal significant mismatches, demonstrating that current development is overly programming-centric while underrepresenting high-value sectors like Management and Legal.

Executive Summary

TL;DR: Researchers from CMU and Stanford have conducted a "reality check" on the AI agent industry. By mapping 43 major benchmarks against the U.S. O*NET labor database, they found that agent development is profoundly skewed toward Software Engineering, ignoring massive, highly digitized, and economically valuable sectors like Management and Law. Furthermore, they provide a new metric for Agent Autonomy—the "complexity ceiling" where performance drops—revealing that agents currently fail rapidly as tasks scale beyond simple, self-contained steps.

Positioning: This is a foundational "evaluation of evaluations" (Meta-Evaluation) that shifts the goalpost from can agents code? to do agents matter for the economy?

The "Convenience Bias" in Agent Research

The core insight of the paper is that agent research is currently driven by methodological convenience rather than economic impact.

Current benchmarks are obsessed with "Computer and Mathematical" tasks. Why? Because code is easy to verify, its rewards are deterministic, and the environment (a terminal or IDE) is clean. However, this domain represents only 7.6% of total U.S. employment.

Meanwhile, domains like Management and Legal—which are over 70% digitized and hold significantly more economic capital—are virtually invisible in current leaderboards. This mismatch creates a "bubble" where agents appear highly capable in a vacuum but lack the skills (like interpersonal coordination or long-horizon information gathering) required for the broader labor market.

Methodology: Mapping Agents to O*NET

To bridge this gap, the authors aligned benchmark tasks (like those from WebArena or SWE-bench) with two taxonomies:

Domain-Based: Job families (e.g., Business, Legal).
Skill-Based: General work activities (e.g., "Interacting with others" vs "Work output").

Mapping Framework

By weighting these mappings with data from the U.S. Bureau of Labor Statistics, they could finally visualize the "White Space" where AI agents should be working but aren't.

The Autonomy Spectrum: When Do Agents Break?

The paper moves beyond the binary "Success/Failure" metric. They define Autonomy as the maximum Task Complexity an agent can handle with a 80%+ success rate.

They use Workflow Induction to break down a long trajectory (e.g., 50 clicks/keystrokes) into semantic, goal-directed steps (e.g., "Authenticate phone", "Retrieve API docs").

Workflow Induction Example

Findings on Autonomy:

The Complexity Ceiling: In almost every domain, success rates plummet as complexity crosses a certain threshold (usually level 6-10).
Skill Gaps: Agents are surprisingly good at "Work Output" (doing things) but terrible at "Information Input" (finding the right thing to do) and "Interaction" (coordinating with others).
Frameworks Matter: While Claude-3.5 generally outperforms GPT-4o in medium complexity coding, both fail as the task becomes a "long-horizon" problem.

Agent Autonomy Results

Critical Insight: Realism vs. Synthesis

One of the paper’s most striking critiques is the quality of task synthesis. Many modern benchmarks (like ColBench) use LLMs to generate high volumes of tasks. However, the analysis shows these synthesized tasks are often contextually shallow, mapping to fewer real-world domains and skills than human-annotated tasks (like those in TheAgentCompany).

If we keep training agents on "toy" synthetic tasks that only require a single skill, they will never generalize to the multifaceted nature of a real manager’s or lawyer’s workday.

Future Outlook: Three Principles for Benchmarking

The authors conclude with a manifesto for the next generation of AI benchmarks:

Coverage: Target the "Capital-Heavy" domains (Management, Legal).
Realism: Stop using templates; ground tasks in realistic domain/skill compositions.
Granular Evaluation: Use intermediate checkpoints derived from human workflows rather than just checking the "Final Answer."

Conclusion

This work serves as a necessary intervention. It warns that we are currently over-optimizing agents for "Working with Computers" while neglecting "Interacting with Others"—the very skill that permeates most high-value human work. For developers, the takeaway is clear: your agent's success on SWE-bench does not guarantee it can survive 10 minutes in a corporate management environment.

发现相似论文

试试这些示例

Search for recent AI agent benchmarks or datasets specifically targeting the Legal or Management domains to address the representation gap identified in the O*NET alignment study.
Which paper first proposed the use of O*NET for quantifying AI impact on the labor market, and how does the current study's "Mapping Agent Benchmarks" approach differ from previous labor impact assessments?
Explore research that applies hierarchical workflow induction or semantic step decomposition to evaluate agent autonomy in non-digital or physical robotics tasks.

目录

[CMU/Stanford] How Well Does Agent Development Reflect Real-World Work?

1. Executive Summary

2. The "Convenience Bias" in Agent Research

3. Methodology: Mapping Agents to O*NET

4. The Autonomy Spectrum: When Do Agents Break?

5. Critical Insight: Realism vs. Synthesis

6. Future Outlook: Three Principles for Benchmarking

7. Conclusion