Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

[CVPR 2026] Faithful GRPO: Solving the Reliability Gap in Visual Spatial Reasoning

Summary

Problem

Method

Results

Takeaways

Abstract

This paper introduces Faithful GRPO (FGRPO), a constrained reinforcement learning method designed to enhance visual spatial reasoning in Multimodal Language Models (MLMs). By treating logical consistency and visual grounding as hard constraints rather than soft rewards during Group Relative Policy Optimization, FGRPO achieves state-of-the-art results on 7 spatial benchmarks, notably reducing CoT inconsistency from 24.5% to 1.7%.

TL;DR

Reinforcement Learning (RL) has become the gold standard for boosting the reasoning capabilities of Multimodal Language Models. However, a "hidden tax" often arises: models learn to "cheat" by giving the right answer while hallucinating the reasoning path. Faithful GRPO (FGRPO) fixes this by transforming reasoning quality from a "nice-to-have" reward into a hard constraint using Lagrangian dual ascent. The result? A model that is not only more accurate but 15x more consistent.

Problem & Motivation: The "Right Answer, Wrong Reason" Trap

In the race to climb SOTA leaderboards, researchers have noticed a disturbing trend in RL-trained models: Logical Inconsistency. A model might spend five sentences explaining why an object is a "lamp" only to abruptly output "box" in the final <answer> tag.

The authors identify two fatal flaws in current Multimodal Reasoning Models (MRMs):

Logical Inconsistency: The CoT trace fails to entail the final answer.
Visual Ungroundedness: Reasoning steps describe objects or spatial relations that simply do not exist in the image.

Standard GRPO fails here because its within-group normalization can wash out signals if every rollout in a group is equally "unfaithful."

Methodology: FGRPO and the Power of Constraints

Instead of simply adding a "faithfulness reward" (which models often trade off for accuracy), FGRPO treats consistency and grounding as prerequisites.

1. Verifiable Rewards for Quality

The framework introduces three specific sensors:

Consistency Reward ( $R_{C}$ ): An LLM judge checks if the conclusion matches the trace.
Semantic Grounding ( $R_{S}$ ): A VLM judge performs per-sentence verification against the image.
Spatial Grounding ( $R_{G}$ ): Uses Hungarian matching and CIoU to verify that predicted bounding boxes actually contain the target objects.

2. Lagrangian Dual Ascent

The core innovation is the optimization objective. FGRPO maximizes task accuracy subject to satisfying these three quality thresholds ( $a u$ ). Using Lagrangian relaxation, the model adaptively adjusts the weight ( $λ$ ) of each constraint. If the model starts hallucinating, $λ$ increases, forcing the policy to prioritize grounding over mere answer-matching.

FGRPO Training Pipeline

3. Decoupled Normalization

To prevent different reward scales from dominating, FGRPO normalizes the task advantage and the constraint advantages independently. This ensures that even small improvements in grounding provide a meaningful gradient for the model.

Experiments & Results: Accuracy and Trust Together

The authors tested FGRPO on Qwen2.5-VL (3B and 7B) across seven grueling spatial benchmarks like CVBench and MindCube.

The Reliability Leap: FGRPO reduced the Inconsistency Rate (IR) from a staggering 24.5% to almost zero (1.7%).
The Accuracy Bonus: Contrary to the belief that constraints hinder performance, FGRPO actually outperformed standard GRPO in final answer accuracy by ~2%.

Reasoning Quality Breakdown

On complex datasets like MindCube, where standard models hallucinated 57% of the time, FGRPO virtually eliminated inconsistent reasoning while improving grounding scores by over 20 percentage points.

Deep Insight: Why Why Does It Work?

The most profound takeaway is the "Emergent Curriculum." As shown in the Lagrange multiplier trajectories, the model doesn't try to solve everything at once. It first learns to follow the format, then focuses on making its reasoning consistent with its answers, and finally fine-tunes its visual grounding. By decoupling these signals, FGRPO provides a stable path for the model to become "truthful."

Conclusion & Future Work

Faithful GRPO proves that we don't have to sacrifice trust for performance. By enforcing "verifiable faithfulness," we can create models that are reliable enough for high-stakes visual tasks like navigation or medical imaging.

Limitations: The reliance on an LLM/VLM judge for training rewards adds computational overhead during the RL loop. Future work may focus on distilling these "judges" into more efficient, specialized reward models.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize Constrained Policy Optimization or Lagrangian multipliers to improve the honesty and faithfulness of Chain-of-Thought reasoning in LLMs.
Which original paper introduced Group Relative Policy Optimization (GRPO), and how have subsequent works like MO-GRPO or GDPO addressed the issue of multi-objective reward hacking?
Find studies that explore the "Accuracy-Faithfulness Tradeoff" in Multimodal Large Language Models and whether increasing model scale naturally resolve reasoning-answer inconsistencies.

Contents

[CVPR 2026] Faithful GRPO: Solving the Reliability Gap in Visual Spatial Reasoning

1. TL;DR

2. Problem & Motivation: The "Right Answer, Wrong Reason" Trap

3. Methodology: FGRPO and the Power of Constraints

3.1. 1. Verifiable Rewards for Quality

3.2. 2. Lagrangian Dual Ascent

3.3. 3. Decoupled Normalization

4. Experiments & Results: Accuracy and Trust Together

5. Deep Insight: Why Why Does It Work?

6. Conclusion & Future Work