EVA: Efficient Reinforcement Learning for End-to-End Video Agent

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

[CVPR 2025] EVA: Evolution of the "Active Watcher" in End-to-End Video Intelligence

总结

问题

方法

结果

要点

摘要

EVA is an efficient reinforcement learning framework for end-to-end video agents that transforms MLLMs from passive recognizers into active observers. By adopting a "planning-before-perception" paradigm and a three-stage training pipeline (SFT, KTO, and GRPO), it achieves a 6–12% improvement over general MLLM baselines while significantly reducing visual token consumption.

Executive Summary

TL;DR: EVA (Efficient Video Agent) is a breakthrough framework that flips the script on video understanding. Instead of feeding a model thousands of frames and asking it to "find the needle," EVA empowers the model to plan its observation strategy first. By utilizing a three-stage RL pipeline (SFT -> KTO -> GRPO), EVA learns to navigate long videos (up to 6600s) with surgical precision, reducing token usage by over 90% while actually increasing accuracy by 6-12%.

Background Positioning: This work represents a shift from Passive Perception (standard MLLMs) to Active Agency. In the landscape of SOTA video models, EVA moves beyond fixed-sampling baselines and rigid tool-use to achieve a truly autonomous, iterative reasoning loop.

The Core Problem: The Sampling Dilemma

Current Video-LLMs face a paradox:

Dense Sampling: High accuracy but hits the "context wall" and burns massive compute.
Uniform Sampling: Efficient but misses brief, critical events (the "needle in a haystack" problem).

The authors argue the root cause is Perception-First architecture. Models are conditioned to look at what they are given. EVA proposes a Planning-First approach: "Tell me what you’re looking for, and I’ll decide which frames are worth my attention."

Methodology: Planning-Before-Perception

EVA operates on a Markov Decision Process (MDP) through an iterative cycle: Summary $\rightarrow$ Plan $\rightarrow$ Action $\rightarrow$ Reflection.

1. The Flexible Toolset

Unlike prior agents restricted to temporal windowing, EVA controls:

start_time / end_time: Precise temporal localization.
nframes: Temporal density.
resize: Spatial resolution (zooming in for fine details).

2. The Three-Stage Evolution

Stage 1: SFT Cold-Start: Teaches the model the "language" of tools and basic interaction formats.
Stage 2: KTO (Kahneman-Tversky Optimization): Instead of complex pairwise rankings, KTO uses binary "success/failure" labels to help the model avoid common strategic pitfalls (like "guessing" without enough data).
Stage 3: Data-Enhanced GRPO: An online RL phase where the model explores self-generated trajectories. Crucially, the authors use a "Data-Enhanced" loop, generating new, harder QA pairs based on current model failures to prevent policy stagnation.

Model Architecture and Training Pipeline

Experimental Performance: Efficiency Meets Power

EVA was tested against heavyweights like Gemini 2.0 and GPT-4o. The results on LSDBench are particularly striking:

Baseline (Qwen2.5-VL): 50.1% Acc @ 166k tokens.
EVA: 51.0% Acc @ 10.3k tokens.

EVA achieves a roughly 16x reduction in visual token costs while maintaining higher accuracy.

Sampling Dilemma Bench Results

Ablation Insight: Why GRPO Matters

The ablation study (Figure 4) shows that as the model progresses from SFT to GRPO, it learns to be smarter, not faster. GRPO-trained agents use fewer total frames but engage in more interaction rounds—effectively "thinking twice" before committing to an answer.

Ablation on Interaction Rounds

Critical Analysis & Takeaways

Why it works: EVA succeeds because it treats "watching a video" as a resource allocation problem. By explicitly modeling Reflection, the model learns to say "I don't have enough info yet, let me zoom in on the 02:30 mark" rather than hallucinating based on a blurry global average.

Limitations: The framework currently relies on a fixed API for "frame selection." If the tool itself fails to capture motion (e.g., fast-moving tiny objects), the agent is still limited by its "eyes."

Future Outlook: The "Planning-before-perception" philosophy is a blueprint for the next generation of AI agents. Expect this logic to move into web navigation and robotic manipulation, where the cost of "looking everywhere" is simply too high.

References

Zhang et al., (2025). "EVA: Efficient Reinforcement Learning for End-to-End Video Agent." SenseTime Research.

发现相似论文

试试这些示例

Search for recent papers that utilize reinforcement learning specifically to optimize visual token allocation or adaptive sampling in multimodal large language models.
Which original research introduced the Kahneman-Tversky Optimization (KTO) for LLM alignment, and how does EVA adapt this single-sample preference logic for multi-turn agentic trajectories?
Explore how the "planning-before-perception" strategy in EVA could be applied to embodied AI or autonomous robotics tasks requiring long-horizon visual memory.

[CVPR 2025] EVA: Evolution of the "Active Watcher" in End-to-End Video Intelligence

1. Executive Summary

2. The Core Problem: The Sampling Dilemma

3. Methodology: Planning-Before-Perception

3.1. 1. The Flexible Toolset

3.2. 2. The Three-Stage Evolution

4. Experimental Performance: Efficiency Meets Power

4.1. Ablation Insight: Why GRPO Matters

5. Critical Analysis & Takeaways

5.1. References