Reinforcing Structured Chain-of-Thought for Video Understanding

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Reinforcing Structured Chain-of-Thought for Video Understanding

[CVPR 2024] SDRL: Breaking Thinking Drift in Video Understanding with Structured RL

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Summary-Driven Reinforcement Learning (SDRL), a single-stage RL framework for Video Multi-modal Large Language Models (MLLMs). It utilizes a Structured Chain-of-Thought (Summarize → Think → Answer) to ground reasoning in temporal facts, achieving state-of-the-art performance across seven VideoQA benchmarks including NExT-GQA and VideoMME.

TL;DR

Video-based Multi-modal Large Language Models (MLLMs) often "hallucinate" logic even when they see the right frames—a phenomenon known as thinking drift. Summary-Driven Reinforcement Learning (SDRL) fixes this by forcing the model to follow a rigid Summarize → Think → Answer pipeline. By using a single-stage RL approach with self-supervised consistency checks, it hits SOTA across 7 benchmarks while being more efficient and concise than traditional SFT+RL methods.

The Problem: When Models Forget What They Saw

Despite the power of Chain-of-Thought (CoT), video understanding models struggle with two main issues:

Thinking Drift: Without constraints, a model's intermediate reasoning can wander away from the actual video content, leading to a "right answer for the wrong reasons" or total failure.
Temporal Blindness: Standard MLLMs often treat videos as a bag of frames, missing the crucial "what happened before what" logic.

Previous attempts to fix this required Supervised Fine-Tuning (SFT) on thousands of human-annotated reasoning paths. This is not only expensive but also creates "lazy" models that only mimic specific paths rather than exploring the best way to solve a problem.

Methodology: Summarize, Then Think

SDRL replaces the complex SFT+RL pipeline with a streamlined, single-stage RL process. The core intuition is that if a model can first produce a factually accurate Summary of actions, its subsequent Thinking will be naturally grounded.

1. Structured CoT

The model is prompted to follow a specific format:

Summarize: A chronological list of key events.
Think: Logical deduction based on that summary.
Answer: The final output.

2. Consistency of Vision Knowledge (CVK)

How do we ensure the summary is correct without human labels? SDRL assumes that if you ask a model the same question multiple times (sampling a group), the "correct" paths should all have very similar summaries. CVK penalizes summaries that diverge from the group's "consensus" anchor, effectively rewarding factual stability.

3. Dynamic Variety of Reasoning (DVR)

While we want summaries to be consistent, we want the reasoning (the "Think" part) to be diverse to explore different logical paths. SDRL uses an entropy-based reward that encourages diversity when the model is uncertain but stabilizes once the model finds a successful path.

SDRL Framework Architecture Figure 1: The SDRL Framework. Note how CVK and DVR act as "weights" on the tokens during the RL update to balance grounding and exploration.

Experimental Triumphs

The researchers tested SDRL against heavyweights like Video-R1 and Qwen2.5-VL. The results were telling:

Higher Accuracy: On the NExT-GQA reasoning benchmark, SDRL scored 79.3%, a +5.7% jump over the base model.
Efficiency: Unlike other RL models that become "chatty" (generating long, redundant reasoning), SDRL's outputs are significantly more concise.

Performance Comparison Table Table 1: State-of-the-art results across various benchmarks. SDRL consistently outperforms SFT+RL hybrids.

Why Self-Supervision Wins

One of the most interesting findings was that for larger models (7B), self-supervision performed better than ground-truth human labels. This suggests that forcing a large model to match a specific human summary can actually "over-constrain" it, whereas self-consistency allows the model to leverage its internal pre-trained knowledge more effectively.

Visual Evidence: Grounding in Action

In qualitative tests, SDRL demonstrated a much higher BLEU and sBERT score for its summaries compared to previous RL methods.

Qualitative Comparison Figure 2: Notice how SDRL (bottom) provides a structured chronological summary that leads directly to the correct answer, whereas other models struggle with temporal order.

Limitations & Future Outlook

While SDRL is a leap forward, it isn't perfect. For extremely fast, "micro-action" videos (like a magic trick or fast shuffling), the model tends to provide a "coarse" summary that might miss the split-second frames needed for a correct answer. Future iterations may need even denser temporal rewards to capture these atomic movements.

Conclusion

SDRL proves that we don't need million-dollar datasets of "expert reasoning" to make MLLMs smarter. By simply enforcing a Structured CoT and rewarding self-consistency, we can create models that are not only more accurate but also more stable and efficient. This work sets a new standard for training high-performance video agents with minimal supervision.

Find Similar Papers

Try Our Examples

Search for recent papers that apply Group Relative Policy Optimization (GRPO) to multimodal tasks beyond video understanding, such as 3D scenes or long-document analysis.
What are the seminal works on "thinking drift" in Large Language Models, and how have subsequent reinforcement learning strategies specifically addressed reasoning consistency?
Explore newer VideoQA datasets that provide dense temporal annotations similar to the EventFlowQA dataset introduced in this paper to compare reasoning fidelity benchmarks.

Contents

[CVPR 2024] SDRL: Breaking Thinking Drift in Video Understanding with Structured RL

1. TL;DR

2. The Problem: When Models Forget What They Saw

3. Methodology: Summarize, Then Think

3.1. 1. Structured CoT

3.2. 2. Consistency of Vision Knowledge (CVK)

3.3. 3. Dynamic Variety of Reasoning (DVR)

4. Experimental Triumphs

4.1. Why Self-Supervision Wins

5. Visual Evidence: Grounding in Action

6. Limitations & Future Outlook

7. Conclusion