WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[MARCH 2026] ARC-AGI-3: The Final Frontier for Agentic Intelligence and the Efficiency Gap
总结
问题
方法
结果
要点
摘要

ARC-AGI-3 is a pioneering interactive benchmark designed to evaluate agentic intelligence through novel, turn-based abstract environments that require exploration, goal inference, and planning. While humans maintain a 100% success rate, frontier AI models like Gemini 3.1 and GPT-5.4 currently score below 1% under the new Relative Human Action Efficiency (RHAE) metric.

TL;DR

The ARC Prize Foundation has released ARC-AGI-3, an interactive benchmark that moves beyond static pattern matching to evaluate Agentic Intelligence. By requiring agents to explore, infer goals, and plan in novel 2D environments without instructions, it reveals a staggering reality: while humans solve these tasks with 100% reliability, the world's most advanced AI models (Gemini 3.1, GPT-5.4) currently achieve scores under 1%.

The Hard Truth: Moving Beyond "Jagged Intelligence"

For years, the AI industry has relied on "pretraining scaling"—the idea that more data and more compute eventually lead to AGI. However, as the authors of ARC-AGI-3 point out, this has led to a phenomenon called "Jagged Intelligence." Models like GPT-o1 or Gemini 3 demonstrate superhuman reasoning in domains where they have vast knowledge (like coding or math), but they crumble when faced with a simple, abstract grid task they haven’t seen before.

The core motivation for ARC-AGI-3 is that static benchmarks are dead. When a model can generate millions of synthetic tasks to "overfit" on a distribution (as seen with ARC-AGI-2), the only way to measure true fluid intelligence is through Interaction and Efficiency.

Methodology: The Four Pillars of an Agent

ARC-AGI-3 transforms the reasoning challenge into a turn-based game. An agent is dropped into a 64x64 grid with 16 colors and must navigate four functional components:

  1. Exploration: Actively interacting to find hidden information.
  2. Modeling: Building a predictive "world model" of environment mechanics.
  3. Goal-Setting: Deciding what to do when no prompt tells you the objective.
  4. Planning & Execution: Mapping a path to the goal while course-correcting based on feedback.

The Scoring Innovation: RHAE (Relative Human Action Efficiency)

The benchmark introduces RHAE, a metric that treats intelligence as a resource-scarcity problem. If a human solves a level in 10 moves and an AI takes 100, the AI isn't just "slower"—it is fundamentally less intelligent in that context. The score is squared to penalize brute-force "guessing" and reward intentional, model-based action.

State-space Graph Representation Figure 3: A graph-based representation of environment states. Intelligent agents identify the shortest path (edges) to the win state (node), while random agents wander aimlessly.

Experiments: The Human-AI Chasm

In March 2026, the baseline testing results provided a humbling reality check for the LLM community:

| Model | RHAE Score | | :--- | :--- | | Human Baseline | 100.00% | | Gemini 3.1 Pro Preview | 0.37% | | GPT-5.4 (High) | 0.26% | | Opus 4.6 (Max) | 0.25% |

Despite the massive reasoning power of "o1-style" search, these models lack the agentic autonomy to infer rules from scratch. They are "verifiable domain" specialists; without an external reward signal or a massive training set, their fluid reasoning remains non-existent.

Frontier AI Progress on ARC Series Figure 1: While ARC-AGI-1 and 2 saw rapid progress via test-time training, ARC-AGI-3 resets the clock, showing that we have yet to solve agentic adaptation.

Critical Analysis: Why Aren't LRMs Winning?

The paper identifies several bottlenecks:

  • Context Management: Naively feeding 64x64 grids into an LRS context window quickly exhausts the "context budget."
  • The Lack of Explanatory Priors: Humans use "Core Knowledge" (objectness, gravity, symmetry) to prune their search space. AI models, despite their size, still struggle to apply these physics-like priors to abstract visual sequences.
  • Zero-Shot Goal Inference: Most AI agents are "Instruction Followers." ARC-AGI-3 requires "Goal Discoverers."

Conclusion: The Path to 2026 and Beyond

ARC-AGI-3 isn't just another benchmark; it’s a gatekeeper. It suggests that the path to AGI isn't just "more layers" or "more tokens," but a fundamental shift toward systems that can autonomously experiment and learn during the execution of a task.

The ARC Prize 2026, with a $2M pool, will be the ultimate battleground. As the authors conclude, ARC-AGI-3 is the only unsaturated general agentic intelligence benchmark left. For the researchers aiming for true AGI, the challenge is clear: stop training on the test, and start building agents that can learn.


Takeaway: If your model needs 1,000,000 examples to learn a rule a child learns in 30 seconds, it isn't "approaching human-level intelligence"—it's just a very large lookup table. ARC-AGI-3 is here to prove it.

发现相似论文

试试这些示例

  • Search for recent papers or benchmarks that specifically target "agentic" intelligence or "action efficiency" in non-linguistic, abstract environments similar to ARC-AGI-3.
  • Which 2019 paper by François Chollet established the theoretical foundation for measuring intelligence as "skill-acquisition efficiency," and how does ARC-AGI-3's interactive format extend that original thesis?
  • Analyze research exploring the use of "Large Reasoning Models" (LRMs) in verifiable scientific domains or robotics to see if "test-time compute" strategies can be adapted to improve action efficiency in abstract reasoning tasks.
目录
[MARCH 2026] ARC-AGI-3: The Final Frontier for Agentic Intelligence and the Efficiency Gap
1. TL;DR
2. The Hard Truth: Moving Beyond "Jagged Intelligence"
3. Methodology: The Four Pillars of an Agent
3.1. The Scoring Innovation: RHAE (Relative Human Action Efficiency)
4. Experiments: The Human-AI Chasm
5. Critical Analysis: Why Aren't LRMs Winning?
6. Conclusion: The Path to 2026 and Beyond