WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2026] VideoSeek: Toward Human-Like Efficient Video Reasoning via Logic-Guided Seeking
总结
问题
方法
结果
要点
摘要

VideoSeek is a long-horizon video agent designed for efficient video-language understanding. It replaces exhaustive frame parsing with a "think–act–observe" loop that actively seeks answer-critical evidence using a multi-granular toolkit, achieving SOTA performance on benchmarks like LVBench and Video-MME while using significantly fewer frames.

TL;DR

VideoSeek is a breakthrough long-horizon video agent that mimics human behavior by "seeking" rather than "watching" everything. By leveraging a think–act–observe loop and a multi-granular toolkit, it outperforms massive models and dense-parsing agents on benchmarks like LVBench and Video-MME while consuming as little as 1/300th of the frames used by competitors.

Problem & Motivation: The "Dense Parsing" Bottleneck

In the race to conquer long-form video understanding, the industry has hit a wall: Computational Cost. Most current SOTA models attempt to "brute-force" video understanding by densely sampling frames (e.g., 2 FPS) and feeding thousands of tokens into a context window.

The authors of VideoSeek observe an interesting paradox: Over 80% of questions in long-video benchmarks can be answered by seeing less than 5% of the frames. Humans don't watch an hour-long movie to find a specific detail; we skip, skim, and zoom. VideoSeek was born to bring this "Logic Flow" navigation to AI agents.

Methodology: The Think-Act-Observe Loop

VideoSeek treats video understanding as a trajectory-based reasoning problem. Instead of a single-pass inference, it operates in a loop:

  1. Thought: The LLM (GPT-5) analyzes what it knows so far and what is missing.
  2. Action: It selects a tool from its specialized toolkit to get more evidence.
  3. Observation: The tool returns new visual data, which is added to the "memory" (trajectory).

The Multi-Granular Toolkit

The secret sauce lies in its three-tier tool design:

  • <overview>: Rapid scan of the whole video (e.g., 16 frames) to build a storyline map.
  • <skim>: Low-cost probing of 10-minute intervals to check for relevant events.
  • <focus>: High-resolution (1 FPS) inspection of short clips to find fine-grained evidence (like a specific face or text).

VideoSeek Architecture and Toolkit Figure 1: The agentic loop guides the model from a global overview to specific, answer-critical moments.

Experiments: SOTA with Sparse Vision

The results on LVBench and Video-MME are striking.

  • Efficiency: On LVBench (with subtitles), VideoSeek achieved 76.7% accuracy using only 27.2 frames. For comparison, the DVD agent used 8,074 frames to achieve a similar 76.0% score.
  • Reasoning Power: Even without subtitles, VideoSeek improved upon GPT-5 by +8.3 points while using only 24% of the frames.

Performance Comparison Figure 2: Accuracy vs. Frame Usage. VideoSeek (the red triangle) sits at the top-left corner, signifying high accuracy with ultra-low frame consumption.

Critical Insights: Why Does It Work?

  1. Logic Flow is Key: When subtitles are available, VideoSeek's frame usage drops further while performance increases. This suggests the agent uses the "textual storyline" as a map to navigate the visual space.
  2. The "Thinker" Matters: Swapping GPT-5 for smaller models like o4-mini or GPT-4.1 led to massive accuracy drops (up to -15.4 points). An agent is only as good as its ability to plan and judge sufficiency.
  3. Intermediate Reasoning: Analysis shows that the gains aren't just from "choosing better frames" but from the accumulated reasoning within the conversation history, which acts as a structured memory.

Conclusion & Future Outlook

VideoSeek demonstrates that Active Seeking is the future of scalable video AI. By moving away from greedy, dense parsing, we can handle hour-long videos on standard hardware.

Limitations: The authors admit the agent might struggle with "surprising moments" (e.g., a car crash that happens without any logical buildup) because it relies on predictable logic flows. However, as a framework for long-form understanding and multimodal assistants, VideoSeek sets a new standard for efficiency and intelligence.

发现相似论文

试试这些示例

  • Find recent papers on "Active Perception" or "Agentic Video Understanding" that use reinforcement learning to optimize frame selection policies.
  • Which paper first introduced the "ReAct" (Reasoning and Acting) framework for LLMs, and how has it been specialized for temporal visual data in later works?
  • Investigate how the "VideoSeek" toolkit approach could be extended to real-time anomaly detection in surveillance where "surprising moments" are hard to predict via logic-flow.
目录
[CVPR 2026] VideoSeek: Toward Human-Like Efficient Video Reasoning via Logic-Guided Seeking
1. TL;DR
2. Problem & Motivation: The "Dense Parsing" Bottleneck
3. Methodology: The Think-Act-Observe Loop
3.1. The Multi-Granular Toolkit
4. Experiments: SOTA with Sparse Vision
5. Critical Insights: Why Does It Work?
6. Conclusion & Future Outlook