PerceptionComp is a new expert-level video benchmark (1,114 questions across 279 videos) designed to evaluate long-horizon, perception-centric reasoning in MLLMs. It uses manually annotated, compositional questions that require models to repeatedly revisit video segments to aggregate temporally distributed evidence, with Gemini-3-Flash currently leading at a modest 45.96% accuracy.
TL;DR
PerceptionComp is a brutal new benchmark for video AI. Unlike previous datasets that can be "cheated" with a quick glance or clever guessing, PerceptionComp requires repeated, compositional perception. Even the most advanced models like Gemini-3 and GPT-o3 are hitting a "perception ceiling" in the mid-40% accuracy range, while humans can solve it perfectly—but only if they are allowed to rewatch the video multiple times.
The "Memory vs. Perception" Gap
Most current video benchmarks suffer from a "single-view" problem. If a human can watch a video once and answer the question, the model likely isn't reasoning; it’s just recalling or pattern matching.
The authors of PerceptionComp argue that true video thinking should be "perception-bottlenecked." They intentionally selected videos with high clutter (detected via SAM2) and intense motion (optical flow). They then built questions using Sequential Composition: you can't solve Step 3 unless you accurately perceived a tiny detail in Step 1.

Method: Breaking the Reasoning Chain
The benchmark uses two core logical structures:
- Conjunctive: All sub-conditions (e.g., "red car" + "turning left" + "near the park") must be found to isolate the answer.
- Sequential: A multi-hop chain where the model must find Object A to identify Location B, which leads to Action C.
This design exposes a critical flaw in modern MLLMs: Mid-chain collapse. If a model misidentifies a spatial relation in Step 2, the entire logical edifice crumbles, even if the model's linguistic "reasoning" remains coherent.
Experimental Results: A Reality Check for Frontier Models
The results are humbling for the AI community.
- Proprietary Leaders: Gemini-3-Flash leads with 45.96%, slightly edging out its "Pro" sibling and GPT-o3.
- Open-Source Gap: Models like Qwen3-VL and InternVL-3.5 hover in the 30-38% range.
- The Human Gold Standard: While experts hit 100%, humans restricted to a single viewing drop to 18.97%—lower than the models! This proves that "thinking" through these videos is impossible without active, iterative re-perception.

Deep Insights: Does More "Thinking" Help?
The paper provides a fascinating analysis of Test-Time Scaling. By increasing the number of "thinking tokens" (deliberation budget) and input frames (perception budget), performance improves across the board.
However, there is a catch: Scaling doesn't fix hallucination. In many cases, Gemini-3-Pro would generate longer reasoning chains but fixate on irrelevant details or "hallucinate" spatial coordinates, leading to a performance inversion where the faster Gemini-3-Flash actually performed better. We call this the "Streamlining Effect"—sometimes focus is more important than depth.

Conclusion: The Path Forward
PerceptionComp reveals that the next frontier for Multimodal AI isn't just "more data" or "bigger LLMs." It is Robust Visual Grounding. Models need to learn how to:
- Maintain consistent "variable binding" (remembering that the 'blue bag' found at 0:10 is the same one at 2:30).
- Resist "Protagonist Bias" (assuming the main character is always the subject of the question).
- Correct their own perceptual errors mid-reasoning.
For researchers, this benchmark serves as a diagnostic tool to move beyond "vibes-based" evaluation and toward rigorous, perception-centric AI.
