The paper introduces an unsupervised self-evolution framework for Multimodal Large Language Models (MLLMs) to enhance reasoning capabilities without human-annotated data. By combining Actor self-consistency with bounded Judge-based modulation and Group Relative Policy Optimization (GRPO), the method achieves SOTA results on mathematical reasoning benchmarks, such as a +5.9% accuracy gain on MathVision.
TL;DR
Researchers from OPPO and Tsinghua University have unveiled an unsupervised framework that allows Multimodal Large Language Models (MLLMs) to self-evolve. By shifting away from "Majority Voting" toward a Distributional Actor-Judge paradigm, they achieved significant gains (+5.9% on MathVision) without a single human label, rivaling performance typically reserved for supervised distillation.
The Problem: The "Echo Chamber" of Majority Voting
In the quest for self-improving AI, "Self-Consistency" or "Majority Voting" (MV) has been the go-to unsupervised signal. The logic is simple: if the model generates the same answer five times out of eight, that answer is likely correct.
However, this paper identifies two fatal flaws in MV:
- Systematic Bias: If a model is biased toward a specific type of error, MV reinforces that error, creating a "dominant mode" that kills exploration.
- Entropy Collapse: Binarized rewards (1 for the majority, 0 for others) push the model toward a near-deterministic, low-entropy state, often causing "response-length collapse" where the model stops "thinking" and just outputs the answer.
Methodology: Joint Actor-Judge Modulation
The proposed framework replaces the binary "majority" reward with a calibrated, continuous signal.
1. Two Roles, One Model
The system instantiates two roles from a single model:
- The Actor: Samples multiple reasoning trajectories for a given image-question pair.
- The Judge: A frozen copy of the model that evaluates the Actor's trajectories based on reasoning quality, answer correctness, and visual grounding.
2. Bounded Modulation
Instead of trusting the Judge blindly (which can lead to instability), the authors use the Judge's score to modulate the Actor's self-consistency signal. They use a specific calibration function $g(s)$: $$g(s) = 1 + \lambda_{+} \sigma \Big (\frac{s - t_{h}}{ au_{h}} \Big) - \lambda_{-} \sigma \Big (\frac{t_{l} - s}{ au_{l}} \Big)$$ This ensures that the Judge acts as a "corrective guide" rather than an absolute authority, preventing Judge-noise from derailing the unsupervised loop.

3. Group Distributional Shaping
Rather than optimizing for absolute scores, the framework uses Group Relative Policy Optimization (GRPO). The scores within a group (trajectories for the same question) are converted into relative advantages via a log-sum-exp baseline. This effectively turns the learning objective into a distribution-matching task, where the model learns to shift probability mass from poor trajectories to better ones, rather than collapsing into a single mode.
Experiments: Superior Gains and Stability
The framework was tested on Qwen2.5-VL-7B across five major mathematical benchmarks.
- Performance: On MathVision, the method reached 30.9%, significantly higher than the base model (25.0%) and MV-based self-training (27.5%).
- Stability: Unlike MV, which saw actor entropy plummet, the Actor-Judge framework maintained healthy entropy, meaning the model continued to explore alternative reasoning paths throughout 20 epochs of training.

Generalization Beyond Math
Interestingly, the training on geometry and math datasets generalized to general vision tasks. On ChartQA, the model improved from 85.7% to 87.2%, proving that "learning to reason" on math problems enhances the underlying visual-semantic grounding of the model.
Critical Insight: Why it Works
The "Secret Sauce" is the combination of Self-Consistency (prior) and Judge-Modulation (posterior). In the early stages, Self-Consistency provides a stable anchor. As the policy evolves, the Judge's nuanced scores allow the model to distinguish between "popular but wrong" and "consistent and right" paths.
By modeling the reward as a group-wise distribution, the gradient updates become smoother. It prevents the model from "overfitting" to its own early guesses, which is the primary failure mode of most unsupervised RL.
Limitations & Future Work
The authors acknowledge a "Judge bottleneck." If the Judge and Actor both share the same fundamental misunderstanding of an image (incorrect consensus), the framework can still fail. Future research will focus on how to evolve the Judge itself or introduce external verification (like execution-based rewards) to break parity when the model is "confidently wrong."
Conclusion
This work provides a scalable blueprint for the next generation of multimodal models. As high-quality data becomes a scarce resource, the ability of models to judge themselves and "lift themselves up by their own bootstraps" through distributional reward modeling will be the key to reaching DeepSeek-R1 levels of reasoning in the multimodal domain.
