WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[ICML 2025] Token Reweighting: The Secret to Balancing Perception and Reasoning in MLLMs
总结
问题
方法
结果
要点
摘要

The paper introduces Token Reweighting (ToR), a plug-and-play strategy for Reinforcement Learning with Verifiable Rewards (RLVR) in Multimodal Large Language Models (MLLMs). It identifies and dynamically reweights "perception-related" and "reasoning-related" tokens during training, achieving state-of-the-art performance on benchmarks like MathVerse and HalluBench.

TL;DR

Reinforcement Learning with Verifiable Rewards (RLVR) has become the gold standard for scaling reasoning in LLMs, but applying it to Multimodal models (MLLMs) is tricky. This paper reveals that perception (seeing) and reasoning (thinking) are "coupled" at the token level. The authors propose Token-Reweighting (ToR), a strategy that identifies critical tokens for both tasks and reweights their gradients, leading to consistent SOTA gains without extra model parameters.

The Motivation: The "Isolated Optimization" Trap

Most current MLLM training paths are split: you either use Chain-of-Thought (CoT) to boost reasoning or use data augmentation to fix perception.

However, MLLM responses are an interleaved stream. If you only optimize reasoning tokens (high entropy points), the model creates beautiful logic based on hallucinated visual details. If you only optimize perception tokens (visually sensitive points), the model identifies objects correctly but fails to connect them into a mathematical proof.

The authors' hypothesis is simple: You cannot fix one without the other because they are fundamentally interdependent.

Methodology: Identifying The "Functional" Tokens

The brilliance of ToR lies in its simplicity. It doesn't need external labels; it uses the model's own "intrinsic signals" to find where to focus the gradient.

  1. Reasoning-related Tokens: Identified by High Predictive Entropy. These are the "forks in the road" where the model is uncertain about the next logical step.
  2. Perception-related Tokens: Identified by Visual Sensitivity. If the probability of a token changes drastically when the image is removed, that token is "grounded" in the visual input.

ToR Concept & Token Roles

The ToR Objective

Once identified, these tokens are given specific weights ($\gamma_r$ and $\gamma_p$) in the RLVR loss (like GRPO or DAPO). Instead of treating every token in a long response as equally important, the gradient focuses on these high-leverage points.

Optimization Behavior Comparison

Experiments & SOTA Performance

The authors tested ToR on Qwen2.5-VL (3B and 7B) across various benchmarks.

  • Logic & Math: On MathVerse, ToR-GRPO outperformed vanilla GRPO by over 2 absolute percentage points.
  • Perception: On HalluBench, it significantly reduced hallucinations, showing that better reasoning actually helps stabilize visual grounding.
  • Generalizability: Whether using the standard GRPO or the more advanced DAPO, ToR provided a consistent "lift" across the board.

| Model | MathVerse | WeMath | HalluBench | | :--- | :---: | :---: | :---: | | GRPO (Vanilla) | 50.8 | 67.4 | 69.8 | | ToR-GRPO | 53.0 | 68.9 | 72.4 |

Critical Insight: The Push-Pull Dynamic

A fascinating finding in the appendix is the Push-Pull relationship between reasoning uncertainty and perception strength. In harder problems, the model needs more reasoning supervision, while easier problems benefit more from refined perception. ToR balances this trade-off automatically during the RL process.

Conclusion

ToR proves that we don't necessarily need more data or bigger models to solve multimodal reasoning; we need smarter gradients. By recognizing that perception and reasoning are two sides of the same coin, ToR provides a compute-efficient, plug-and-play path to the next generation of "Reasoning-level" MLLMs.

Limitations: Currently, token identification relies on basic entropy and log-prob differences. Future work could integrate more fine-grained spatial grounding (like SAM) to pinpoint perception tokens even more accurately.

发现相似论文

试试这些示例

  • Search for recent papers that utilize token-level predictive entropy to identify critical decision points in Large Language Model reasoning chains.
  • Which study first introduced the concept of visual sensitivity via log-probability differences for grounding analysis in multimodal models?
  • Find research that applies dynamic loss reweighting or gradient masking to improve the alignment of Vision-Language Models in Reinforcement Learning settings.
目录
[ICML 2025] Token Reweighting: The Secret to Balancing Perception and Reasoning in MLLMs
1. TL;DR
2. The Motivation: The "Isolated Optimization" Trap
3. Methodology: Identifying The "Functional" Tokens
3.1. The ToR Objective
4. Experiments & SOTA Performance
5. Critical Insight: The Push-Pull Dynamic
6. Conclusion