WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[PEPO] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
Summary
Problem
Method
Results
Takeaways
Abstract

This paper introduces Perception-Exploration Policy Optimization (PEPO), a token-level RL framework that enhances Multimodal Chain-of-Thought (CoT) reasoning. By integrating a perception prior (via hidden-state similarity) and reasoning entropy through a smooth gating mechanism, it achieves SOTA results on benchmarks like Geometry3K (+3.67%) and MathVista, outperforming standard GRPO and DAPO baselines.

TL;DR

Recent advances in Large Vision-Language Models (LVLMs) rely heavily on Reinforcement Learning with Verifiable Rewards (RLVR) to solve complex puzzles and math. However, the industry-standard GRPO (Group Relative Policy Optimization) treats every token in a "Chain-of-Thought" as equally important—whether it's a critical observation of a geometric diagram or a filler word like "therefore."

PEPO (Perception-Exploration Policy Optimization) fixes this by looking under the hood of the model. By measuring how much each token "looks" at the image (Perception) and how "uncertain" the model is (Exploration), it reweights the training signal at the token level. The result? Sharper reasoning, better grounding, and zero extra parameters.


The Problem: The "Flat" Advantage Trap

In standard RLVR, if a model gets a geometry question right, the entire text sequence gets a "pat on the back." If it gets it wrong, every token is penalized.

  • The Issue: This ignores the Multimodal Constraint. A model might fail because it misidentified a line in an image (perceptual failure), not because its logic was wrong.
  • The Limitation of Entropy: Previous attempts to use "Entropy" (uncertainty) to weight tokens only capture linguistic doubt. They don't tell the model if it's actually looking at the visual evidence.

Methodology: Perception Meets Exploration

The core innovation of PEPO is the Perception-Exploration Fusion. It relies on two physical intuitions:

  1. Perception Prior (The "Anchor"): The authors discovered that "correct" reasoning tokens have higher cosine similarity in their hidden states to the visual tokens of the image. PEPO calculates this Visual Similarity (VS) dynamically.
  2. Exploration Prior (The "Compass"): High-entropy tokens represent decision points where the model explores different reasoning paths.

1. The Smooth Gating Mechanism

Instead of just adding these two signals, PEPO uses a gated operator: Framework Overview

The formula ensures that Perception remains dominant. Entropy only acts as a modulator for tokens that are already somewhat grounded in the image. This prevents the model from "hallucinating" with high confidence on irrelevant text.

2. Implementation

PEPO is "plug-and-play." It integrates with GRPO or DAPO by simply multiplying the sequence-level advantage $A^{(i)}$ by the calculated token weight $w_t$.

  • Efficiency: Because it uses existing hidden states, the computational overhead is less than 1%.

Experimental Results: Where It Counts

Across a battery of benchmarks (Geometry3K, RefCOCO, MathVista), PEPO consistently outperformed the vanilla baselines.

1. Geometry and Logic Reasoning

In Geometry3K, PEPO improved the Qwen2.5-VL-3B model by over 4 points on the validation set. Qualitative cases show that PEPO stays "on track" with visual evidence, whereas standard GRPO often ignores the diagram midway through a long derivation.

Performance Gains

2. Visual Grounding & Puzzles

On the LISA-Grounding task, which requires precise localization, PEPO saw a significant jump in IoU (Intersection over Union). This proves that the token-level weights successfully "force" the model's internal representations to align more closely with visual regions.


Critical Insights: Why it Works

The ablation studies reveal a crucial finding: neither Perception-only nor Exploration-only weighting is sufficient.

  • Perception-only limits the model's reasoning diversity.
  • Exploration-only (Entropy) can lead to "model collapse" in tasks like visual grounding (RefCOCO), where the model gets lost in its own uncertainty.

Training Dynamics: As shown in Fig. 6, PEPO maintains a higher "Visual Similarity" throughout the training process compared to GRPO, effectively training the model to "stare harder" at the relevant parts of the image as it thinks.

Training Dynamics


Conclusion

PEPO represents a shift from Outcome-based RL (did you get the answer right?) to Process-based RL (did you look at the right things while thinking?). By leveraging the innate similarity between text and visual hidden states, it provides a "free" way to ground Long Chain-of-Thought reasoning.

Future Outlook: While evaluated on 2B and 3B models, the next frontier for PEPO is scaling to 72B+ parameters and video-based reasoning, where the "credit assignment" problem is even more severe.


Senior Editor's Note: PEPO is a textbook example of how a deep understanding of internal model dynamics can replace expensive external supervision or auxiliary model branches.

Find Similar Papers

Try Our Examples

  • Search for recent papers on token-level credit assignment or advantage redistribution in Large Vision-Language Model reinforcement learning.
  • Which study first introduced internal hidden-state similarity as a proxy for visual grounding in LLMs, and how does PEPO's gated fusion specifically innovate upon that baseline?
  • Explore if the Perception-Exploration gating mechanism has been applied to video-language models or long-context multimodal reasoning tasks to solve credit assignment over extended sequences.
Contents
[PEPO] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
1. TL;DR
2. The Problem: The "Flat" Advantage Trap
3. Methodology: Perception Meets Exploration
3.1. 1. The Smooth Gating Mechanism
3.2. 2. Implementation
4. Experimental Results: Where It Counts
4.1. 1. Geometry and Logic Reasoning
4.2. 2. Visual Grounding & Puzzles
5. Critical Insights: Why it Works
6. Conclusion