WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2024] Reflect to Inform: Solving "Visual Drift" via Information-Gain-Driven VRE
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces Visual Re-Examination (VRE), a self-evolving training framework designed to enhance the reasoning capabilities of Multimodal Large Language Models (MLLMs). By implementing an iterative RLVR-SFT pipeline, VRE enables models to autonomously trigger a "look-back" mechanism to original image tokens, achieving SFT-surpassing results on benchmarks like MathVista and V*-Bench.

TL;DR

Multimodal Large Language Models (MLLMs) often suffer from "Visual Drift"—they start sharp but become "visually blind" as the reasoning chain lengthens. This paper introduces Visual Re-Examination (VRE), a framework that teaches models to autonomously pause and re-look at the image when they sense uncertainty. By using a self-iterative RL/SFT loop driven by Information Gain, VRE allows a 7B model to outperform tool-using agents and challenge proprietary giants.

The Problem: Late-Stage Visual Blindness

Even the strongest MLLMs exhibit a curious failure mode: in complex tasks, the attention to visual tokens decays over time. The model begins to "hallucinate" based on textual probability rather than image evidence.

Existing solutions usually fall into two camps:

  1. Explicit Tools: Injecting high-res crops or zooms (High computational cost).
  2. Implicit Refocusing: Asking the model to "think again" (Often leads to redundant paraphrasing).

The authors discovered that the visual sensitivity isn't gone; it's just dormant. The challenge is transforming this sporadic re-look behavior into a robust, learnable policy.

Methodology: The VRE Framework

VRE is built on the philosophy of Self-Evolution. Instead of distilling from a teacher like GPT-4o—which can introduce "perceptual discrepancies"—the model learns from its own reasoning manifold.

Overall Architecture

1. Cold-Start SFT

The model is first taught a structural template: Reasoning -> <reflection> -> Answer. The authors use a Difficulty-Aware Sample Partitioning method to identify "Unstable" queries—those where the model is sometimes right and sometimes wrong. These are the "golden" training signals.

2. Information-Gain-Driven RL (GRPO)

Using Group Relative Policy Optimization, the model is rewarded based on three components:

  • Format Reward: Correct tag usage.
  • Accuracy Reward: Solving the problem correctly.
  • Reflection Reward: The "Secret Sauce." It uses an LLM-based scorer to ensure the <reflection> block actually contains new visual evidence (Information Gain) rather than just repeating prior text.

3. Homologous Reconstruction

Crucially, VRE re-feeds the initial reasoning back into the same model. This ensures perception-reasoning alignment, avoiding the "interpreting through another observer's eyes" problem found in multi-model pipelines.

Experimental Results: Small Model, Big Performance

VRE was tested on Qwen2.5-VL-7B across four major dimensions.

Performance Comparison

  • Math & Logic: Gains of +6.6% on WeMath and +3.0% on MathVista.
  • Perception: A massive +7.4% jump on V-Bench*, outperforming tool-augmented models like Thyme and DeepEyesV2.
  • Visual Accuracy: VRE reduces "careless" errors through Recheck Reflection—a final verification step before committing to an answer.

Mechanical Insights: Why does it work?

The authors performed an Attention Analysis to prove the model actually "sees" better.

Attention Analysis

In the figure above, you can see that in the base model, attention to visual tokens (the colorful bars) decays. In VRE, there is a sharp spike during the reflection phase. This isn't just a general increase in attention; heatmap visualizations show a "search-and-lock" operation where the model focuses precisely on the missing object (e.g., a dustpan or a specific digit).

Conclusion & Future Outlook

VRE demonstrates that self-correction is a valid path for multimodal models. By quantifying "Information Gain," we can prevent models from falling into a "loop of verbosity" and instead force them to re-ground their logic in physical evidence.

Final Takeaway: The future of MLLM reasoning might not be "more parameters," but rather "smarter introspection."

Limitations: While VRE is highly effective, it still relies on a verifiable ground truth for the RL stage, which might be harder to scale for open-ended creative tasks.

Find Similar Papers

Try Our Examples

  • Search for recent papers published after 2024 that specifically address "visual drift" or "modality disconnect" in long-chain multimodal CoT reasoning.
  • Which study first introduced the concept of Reinforcement Learning with Verifiable Rewards (RLVR) in the context of Multimodal LLMs, and how does VRE's reward structure differ?
  • Examine the potential of applying Information-Gain-Driven Reflection to video-language models (Vid-LLMs) for long-video temporal grounding tasks.
Contents
[CVPR 2024] Reflect to Inform: Solving "Visual Drift" via Information-Gain-Driven VRE
1. TL;DR
2. The Problem: Late-Stage Visual Blindness
3. Methodology: The VRE Framework
3.1. 1. Cold-Start SFT
3.2. 2. Information-Gain-Driven RL (GRPO)
3.3. 3. Homologous Reconstruction
4. Experimental Results: Small Model, Big Performance
5. Mechanical Insights: Why does it work?
6. Conclusion & Future Outlook