MoE-GRPO is a reinforcement learning (RL) framework designed to optimize expert selection in Mixture-of-Experts (MoE) Vision-Language Models (VLMs). By formulating routing as a sequential decision-making problem and utilizing Group Relative Policy Optimization (GRPO), it achieves SOTA results on image/video benchmarks, outperforming standard top-K routing by over 2.0% on average.
TL;DR
MoE-GRPO shifts the paradigm of Mixture-of-Experts (MoE) from fixed routing to policy-driven selection. By applying Group Relative Policy Optimization (GRPO) to the expert selection process, the model learns to "search" for the best combination of parameters for every token across layers. This leads to higher diversity, less overfitting, and superior performance across image and video benchmarks—all while maintaining the sparse inference efficiency of MoE.
Background: The Greedy Trap of Top-K Routing
In the world of sparse Transformers, Mixture-of-Experts (MoE) is the gold standard for "scaling without the cost." However, almost all SOTA models (like DeepSeek or InternVL) use a deterministic top-K router. This router makes a greedy local decision: "For this token, pick the experts with the highest scores."
From an RL perspective, this is a classic exploitation-without-exploration problem. The model quickly settles on a few "heavy-lifter" experts and ignores the rest, leading to expert overfitting. MoE-GRPO argues that selecting an expert is an action that contributes to the final reward (the correctness of the answer), and thus should be optimized using Reinforcement Learning.
Methodology: High-Dimensional Routing as Sequential Decision Making
The core innovation lies in expanding the action space of RL. While standard RLHF/GRPO only cares about the output tokens, MoE-GRPO cares about the pathway taken through the model.
1. Dual-Objective Optimization
The model optimizes two loss functions simultaneously:
- Token-GRPO: Standard RL that ensures the generated text is correct.
- Gate-GRPO: A dense supervision signal that rewards or punishes specific routing decisions at every layer based on the final output's success.
2. Modality-Aware Router Guidance
To prevent the RL agent from getting lost in the astronomical number of possible routing combinations, the authors introduce a "Modality-Aware" constraint. By calculating vision-awareness and text-awareness scores for each expert, the model deactivates the bottom 25% of "irrelevant" experts for a given token type (e.g., preventing a text expert from being wastefully explored for a raw image patch).

Results: Breaking Through the Ceiling
The experiments on InternVL3.5-1B (converted to an 8-expert MoE) show a clean sweep across benchmarks.
- Better Generalization: On MVBench (Video), MoE-GRPO outpaced deterministic fine-tuning by a significant margin.
- Expert Diversity: Visualization of routing entropy shows that MoE-GRPO uses a much wider variety of experts compared to the "winner-takes-all" behavior of standard models.

Task-Level Specialization
One of the most profound findings is that MoE-GRPO induces task-level specialization. By analyzing Jensen-Shannon Divergence (JSD), the authors found that the model learns to route "action recognition" tokens through different experts than "object counting" tokens, even if the low-level visual features are similar.

Critical Insight & Future Outlook
Why does this work? Most MoE models suffer from "representation collapse" where experts become redundant. MoE-GRPO forces the model to explore "sub-optimal" paths during training. When one of these paths leads to a correct answer that the "standard" path missed, the model receives a massive advantage signal, effectively "unlocking" specialized knowledge that greedy routing would have never found.
Limitations: Training with 8 rollouts per sample increases the training compute significantly, even if inference remains the same. Future work should focus on making this "Routing-RL" more sample-efficient.
Takeaway: We are moving toward a future where every part of the model's architecture—not just its output—is a learnable policy optimized for the final task reward.
