WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2026] MoE-GRPO: Teaching Vision-Language Models the Art of Expert Selection via Reinforcement Learning
Summary
Problem
Method
Results
Takeaways
Abstract

MoE-GRPO is a reinforcement learning (RL) framework designed to optimize expert selection in Mixture-of-Experts (MoE) Vision-Language Models (VLMs). By formulating routing as a sequential decision-making problem and utilizing Group Relative Policy Optimization (GRPO), it achieves SOTA results on image/video benchmarks, outperforming standard top-K routing by over 2.0% on average.

TL;DR

MoE-GRPO shifts the paradigm of Mixture-of-Experts (MoE) from fixed routing to policy-driven selection. By applying Group Relative Policy Optimization (GRPO) to the expert selection process, the model learns to "search" for the best combination of parameters for every token across layers. This leads to higher diversity, less overfitting, and superior performance across image and video benchmarks—all while maintaining the sparse inference efficiency of MoE.

Background: The Greedy Trap of Top-K Routing

In the world of sparse Transformers, Mixture-of-Experts (MoE) is the gold standard for "scaling without the cost." However, almost all SOTA models (like DeepSeek or InternVL) use a deterministic top-K router. This router makes a greedy local decision: "For this token, pick the experts with the highest scores."

From an RL perspective, this is a classic exploitation-without-exploration problem. The model quickly settles on a few "heavy-lifter" experts and ignores the rest, leading to expert overfitting. MoE-GRPO argues that selecting an expert is an action that contributes to the final reward (the correctness of the answer), and thus should be optimized using Reinforcement Learning.

Methodology: High-Dimensional Routing as Sequential Decision Making

The core innovation lies in expanding the action space of RL. While standard RLHF/GRPO only cares about the output tokens, MoE-GRPO cares about the pathway taken through the model.

1. Dual-Objective Optimization

The model optimizes two loss functions simultaneously:

  • Token-GRPO: Standard RL that ensures the generated text is correct.
  • Gate-GRPO: A dense supervision signal that rewards or punishes specific routing decisions at every layer based on the final output's success.

2. Modality-Aware Router Guidance

To prevent the RL agent from getting lost in the astronomical number of possible routing combinations, the authors introduce a "Modality-Aware" constraint. By calculating vision-awareness and text-awareness scores for each expert, the model deactivates the bottom 25% of "irrelevant" experts for a given token type (e.g., preventing a text expert from being wastefully explored for a raw image patch).

Overall Architecture

Results: Breaking Through the Ceiling

The experiments on InternVL3.5-1B (converted to an 8-expert MoE) show a clean sweep across benchmarks.

  • Better Generalization: On MVBench (Video), MoE-GRPO outpaced deterministic fine-tuning by a significant margin.
  • Expert Diversity: Visualization of routing entropy shows that MoE-GRPO uses a much wider variety of experts compared to the "winner-takes-all" behavior of standard models.

Experimental Results

Task-Level Specialization

One of the most profound findings is that MoE-GRPO induces task-level specialization. By analyzing Jensen-Shannon Divergence (JSD), the authors found that the model learns to route "action recognition" tokens through different experts than "object counting" tokens, even if the low-level visual features are similar.

Routing Specialization

Critical Insight & Future Outlook

Why does this work? Most MoE models suffer from "representation collapse" where experts become redundant. MoE-GRPO forces the model to explore "sub-optimal" paths during training. When one of these paths leads to a correct answer that the "standard" path missed, the model receives a massive advantage signal, effectively "unlocking" specialized knowledge that greedy routing would have never found.

Limitations: Training with 8 rollouts per sample increases the training compute significantly, even if inference remains the same. Future work should focus on making this "Routing-RL" more sample-efficient.

Takeaway: We are moving toward a future where every part of the model's architecture—not just its output—is a learnable policy optimized for the final task reward.

Find Similar Papers

Try Our Examples

  • Search for recent papers that apply reinforcement learning algorithms (like PPO, DPO, or GRPO) to optimize internal neural network architectural components beyond token generation.
  • Which original paper introduced the "expert overfitting" and "representation collapse" problems in sparse MoE models, and how do they compare the deterministic vs. stochastic routing solutions?
  • Examine the potential of applying modality-aware routing or RL-based MoE optimization to multi-modal tasks involving audio, heatmaps, or robotic control signals.
Contents
[CVPR 2026] MoE-GRPO: Teaching Vision-Language Models the Art of Expert Selection via Reinforcement Learning
1. TL;DR
2. Background: The Greedy Trap of Top-K Routing
3. Methodology: High-Dimensional Routing as Sequential Decision Making
3.1. 1. Dual-Objective Optimization
3.2. 2. Modality-Aware Router Guidance
4. Results: Breaking Through the Ceiling
4.1. Task-Level Specialization
5. Critical Insight & Future Outlook