WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[UCLA 2026] OpenVLThinkerV2: Mastering Multimodal Reasoning via Gaussian G2RPO and Task-Level Shaping
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces OpenVLThinkerV2, a generalist multimodal reasoning model trained using a novel Reinforcement Learning objective called Gaussian GRPO (G2RPO). By mapping diverse task rewards to a standard normal distribution via 1D Optimal Transport, it achieves SOTA performance across 18 benchmarks, including a 71.6% score on MMMU and 79.5% on MathVista.

TL;DR

OpenVLThinkerV2 is a generalist multimodal model that tackles the instability of Reinforcement Learning (RL) in diverse visual domains. By introducing Gaussian GRPO (G2RPO)—a method that uses 1D Optimal Transport to map any reward distribution to a standard normal curve—the researchers at UCLA have created a model that outperforms GPT-5 and Gemini 2.5 Pro on benchmarks like DocVQA and MathVista.

The "Reward Topology" Nightmare

Training a single model to handle both visual grounding (where rewards are continuous IoU scores) and mathematical reasoning (where rewards are sparse and binary) is an optimization nightmare.

Current state-of-the-art RL objectives like GRPO rely on linear normalization (subtracting mean, dividing by standard deviation). In a multi-task setting, this leads to:

  1. Inter-task Imbalance: High-variance tasks dominate the gradients, while low-variance tasks are ignored.
  2. Sensitivity to Outliers: "Lucky" responses with abnormally high rewards can inflate the variance, suppressing the learning signal for all other samples.
  3. Entropy Pathologies: Models either "overthink" simple perception tasks (entropy explosion) or become "lazy" and over-confident in reasoning (entropy collapse).

Methodology: Gaussian G2RPO

The core innovation is G2RPO. Instead of linear scaling, it uses Optimal Transport (OT) to find a mapping function that transforms the empirical reward distribution of any task into a Standard Normal Distribution .

1. Non-linear Distributional Matching

By utilizing the Cumulative Distribution Function (CDF), G2RPO assigns advantages based on the relative rank of a response. This mathematically:

  • Caps Outliers: Extreme values are mapped to the tails of the Gaussian, preventing gradient spikes.
  • Ensures Equity: Every task, regardless of its original reward scale, produces the same gradient magnitude.

Model Architecture Figure: Comparison of G2RPO against prior methods. It provides intrinsic robustness and uniform variance across tasks.

2. Task-level Shaping

To address the divergent needs of perception and reasoning, the authors introduced two "shaping" mechanisms:

  • Length Shaping: A trapezoidal reward envelope that encourages the model to generate long "Chain-of-Thought" (CoT) for math, but penalizes "overthinking" (hallucinated verbosity) for OCR and grounding tasks.
  • Entropy Shaping: A margin-based penalty that keeps the model within a "Goldilocks zone" of exploration, preventing the incoherent generation (explosion) or premature convergence (collapse).

Response Length Dynamics Figure: Length dynamics during training. G2RPO successfully elicits reasoning chains while enforcing concise visual grounding.

Experiments and Results: Slaying the Giants

OpenVLThinkerV2 was evaluated on 18 benchmarks. Despite being an 8B-parameter model, it consistently outperformed proprietary frontier models.

  • Math & Science: Surpassed GPT-4o on MMMU (71.6%) and MathVista (79.5%).
  • Document Understanding: Achieved 911 on OCRBench, outperforming GPT-5 and Gemini 2.5 Pro.
  • Visual Grounding: Reached 93.4% on RefCOCO, proving that it hasn't lost its "low-level" perception skills while learning "high-level" reasoning.

Performance Table Figure: Evaluation across General, Math, and Chart VQA. OpenVLThinkerV2 leads the open-source pack.

Critical Insight: Why G2RPO Matters

The shift from linear normalization to distributional matching via Optimal Transport is a profound change in how we think about RL for LLMs. It acknowledges that "reward" is not an absolute physical value but a relative signal. By forcing the advantage into a Gaussian shape, the researchers have created a universal "exchange rate" for gradients across different task domains.

Conclusion & Future Work

OpenVLThinkerV2 proves that intelligent RL objectives can compensate for smaller model sizes. The G2RPO framework is naturally suited for other heterogeneous domains like SWE-bench (coding) or GUI navigation. The next frontier will likely involve automated hyperparameter search for the length and entropy envelopes to remove the need for empirical "shaping" entirely.

Takeaway: If you are training multi-task RL models, stop standardizing your rewards—start mapping them to a Gaussian.

Find Similar Papers

Try Our Examples

  • Search for recent papers that apply 1D Optimal Transport or non-linear mapping to Reinforcement Learning advantage estimation beyond multimodal tasks.
  • What are the seminal works on Group Relative Policy Optimization (GRPO), and how have subsequent iterations like EMA-GRPO or Dr.GRPO addressed inter-task gradient imbalance?
  • Explore research utilizing adaptive entropy or response length regularization to balance 'thinking' versus 'perception' in large vision-language models.
Contents
[UCLA 2026] OpenVLThinkerV2: Mastering Multimodal Reasoning via Gaussian G2RPO and Task-Level Shaping
1. TL;DR
2. The "Reward Topology" Nightmare
3. Methodology: Gaussian G2RPO
3.1. 1. Non-linear Distributional Matching
3.2. 2. Task-level Shaping
4. Experiments and Results: Slaying the Giants
5. Critical Insight: Why G2RPO Matters
6. Conclusion & Future Work