WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2026] PerceptionComp: Why Your MLLM Fails at "Thinking" in the Real World
总结
问题
方法
结果
要点
摘要

PerceptionComp is a new expert-level video benchmark (1,114 questions across 279 videos) designed to evaluate long-horizon, perception-centric reasoning in MLLMs. It uses manually annotated, compositional questions that require models to repeatedly revisit video segments to aggregate temporally distributed evidence, with Gemini-3-Flash currently leading at a modest 45.96% accuracy.

TL;DR

PerceptionComp is a brutal new benchmark for video AI. Unlike previous datasets that can be "cheated" with a quick glance or clever guessing, PerceptionComp requires repeated, compositional perception. Even the most advanced models like Gemini-3 and GPT-o3 are hitting a "perception ceiling" in the mid-40% accuracy range, while humans can solve it perfectly—but only if they are allowed to rewatch the video multiple times.

The "Memory vs. Perception" Gap

Most current video benchmarks suffer from a "single-view" problem. If a human can watch a video once and answer the question, the model likely isn't reasoning; it’s just recalling or pattern matching.

The authors of PerceptionComp argue that true video thinking should be "perception-bottlenecked." They intentionally selected videos with high clutter (detected via SAM2) and intense motion (optical flow). They then built questions using Sequential Composition: you can't solve Step 3 unless you accurately perceived a tiny detail in Step 1.

Overall Architecture

Method: Breaking the Reasoning Chain

The benchmark uses two core logical structures:

  1. Conjunctive: All sub-conditions (e.g., "red car" + "turning left" + "near the park") must be found to isolate the answer.
  2. Sequential: A multi-hop chain where the model must find Object A to identify Location B, which leads to Action C.

This design exposes a critical flaw in modern MLLMs: Mid-chain collapse. If a model misidentifies a spatial relation in Step 2, the entire logical edifice crumbles, even if the model's linguistic "reasoning" remains coherent.

Experimental Results: A Reality Check for Frontier Models

The results are humbling for the AI community.

  • Proprietary Leaders: Gemini-3-Flash leads with 45.96%, slightly edging out its "Pro" sibling and GPT-o3.
  • Open-Source Gap: Models like Qwen3-VL and InternVL-3.5 hover in the 30-38% range.
  • The Human Gold Standard: While experts hit 100%, humans restricted to a single viewing drop to 18.97%—lower than the models! This proves that "thinking" through these videos is impossible without active, iterative re-perception.

SOTA Performance Comparison

Deep Insights: Does More "Thinking" Help?

The paper provides a fascinating analysis of Test-Time Scaling. By increasing the number of "thinking tokens" (deliberation budget) and input frames (perception budget), performance improves across the board.

However, there is a catch: Scaling doesn't fix hallucination. In many cases, Gemini-3-Pro would generate longer reasoning chains but fixate on irrelevant details or "hallucinate" spatial coordinates, leading to a performance inversion where the faster Gemini-3-Flash actually performed better. We call this the "Streamlining Effect"—sometimes focus is more important than depth.

Model Failure Analysis

Conclusion: The Path Forward

PerceptionComp reveals that the next frontier for Multimodal AI isn't just "more data" or "bigger LLMs." It is Robust Visual Grounding. Models need to learn how to:

  • Maintain consistent "variable binding" (remembering that the 'blue bag' found at 0:10 is the same one at 2:30).
  • Resist "Protagonist Bias" (assuming the main character is always the subject of the question).
  • Correct their own perceptual errors mid-reasoning.

For researchers, this benchmark serves as a diagnostic tool to move beyond "vibes-based" evaluation and toward rigorous, perception-centric AI.

发现相似论文

试试这些示例

  • Search for recent papers that utilize Test-Time Scaling or Reinforcement Learning (like DeepSeek-R1) specifically for improving video-based perceptual grounding in MLLMs.
  • Which studies first established the use of "Thinking Tokens" or "Chain-of-Thought" in multimodal models, and how has their effectiveness varied between static images and dynamic video reasoning?
  • Explore newer benchmarks or models that address spatial-temporal "mid-chain collapse" and variable binding errors in long-context video understanding.
目录
[CVPR 2026] PerceptionComp: Why Your MLLM Fails at "Thinking" in the Real World
1. TL;DR
2. The "Memory vs. Perception" Gap
3. Method: Breaking the Reasoning Chain
4. Experimental Results: A Reality Check for Frontier Models
5. Deep Insights: Does More "Thinking" Help?
6. Conclusion: The Path Forward