Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

[CVPR 2026] DCPO: Solving the Over-Confidence Crisis in Reasoning LLMs

总结

问题

方法

结果

要点

摘要

The paper introduces DCPO (Decoupled Calibration Policy Optimization), a framework designed to fix the "over-confidence" issue in Large Language Models trained via Reinforcement Learning from Verifiable Rewards (RLVR). It achieves SOTA calibration performance (e.g., 71.6% ECE reduction) while maintaining the reasoning accuracy of competitive baselines like GRPO.

TL;DR

Reinforcement Learning from Verifiable Rewards (RLVR) has been the secret sauce for the "o1-style" reasoning revolution. However, it comes with a hidden cost: Calibration Degeneration. These models don't just get smarter; they get more arrogant, often assigning 99% probability to hallucinated or incorrect math answers. DCPO (Decoupled Calibration Policy Optimization) provides a mathematical and architectural fix by decoupling the "learning to think" process from the "learning to self-assess" process, effectively killing the accuracy-calibration tradeoff.

The Arrogance of Reinforcement Learning

Why do RL-trained models lie with such confidence? The authors provide a sobering theoretical insight: Trajectory-level RL induces mode collapse. In the quest to maximize expected reward, the model's policy is mathematically incentivized to shift all probability mass toward a single "correct" path. This makes the probability distribution overly "sharp."

Even worse, the researchers discovered a Gradient Conflict. As shown in the visual below, the gradient vector for maximizing accuracy and the one for minimizing calibration error point in opposite directions in the Fisher-information space. Figure 1: Illustration of gradient conflict between policy accuracy maximization and calibration error minimization

When you try to optimize both simultaneously (the approach taken by previous SOTA like RLCR), the gradients fight each other, resulting in a model that is either stupid but honest, or smart but lying.

Methodology: The Power of Decoupling

To break this deadlock, DCPO introduces three layers of separation:

Block-wise Verbalization: Instead of hidden logits, the model is prompted to explicitly output a confidence score after the reasoning chain (e.g., <conf> 0.85).
Decoupled Advantage Estimation: The reasoning tokens are judged by correctness. The confidence tokens are judged by a Hybrid Reward ( $R_{I G}$ ), which combines the specific instance result with the average accuracy of the entire sampling group. Use of group-level signals acts as a low-variance anchor, stabilizing the training.
Masked Gradient Optimization: This is the "Aha!" moment. During backprop, the gradients from the reasoning reward only update the reasoning tokens, and the gradients from the confidence reward only update the confidence tokens. No cross-contamination means no gradient conflict.

Figure 2: The overall framework of DCPO

Experimental Results: Having Your Cake and Eating It Too

The results on benchmarks like AIME and MATH-500 are striking. Previous methods (RLCR, CCGPSG) consistently traded away 3-5% accuracy to improve calibration. DCPO, however, maintains a flat or even slightly improved accuracy line relative to the standard GRPO while slashing the Expected Calibration Error (ECE) by over 70%.

| Method | Avg Accuracy (%) | Avg ECE (Lower is better) | | :--- | :--- | :--- | | GRPO (Base RL) | 60.4 | 0.248 | | RLCR (Coupled) | 56.5 | 0.139 | | DCPO (Ours) | 60.8 | 0.128 |

In the reliability diagrams below, we can see DCPO's predicted confidence (blue bars) finally aligning with actual accuracy, whereas standard GRPO is heavily skewed towards the 1.0 confidence bin even for incorrect answers. Figure 8: Distribution of verbalized confidence predictions

Critical Analysis & Conclusion

DCPO's greatest contribution is proving that the "accuracy-calibration tradeoff" is not a law of nature, but a flaw in our optimization design. By recognizing that Reasoning and Self-Awareness occupy different functional spaces in the gradient landscape, we can train models that are both highly capable and deeply reliable.

Limitations: DCPO currently relies on "Verifiable Rewards" (binary right/wrong), which are easy to get for math and code but harder for subjective writing tasks. Future work will need to explore how to extend this "decoupling" to Reward Models (RMs) where the ground truth is as fuzzy as the model's confidence.

Takeaway for Practitioners: If you are fine-tuning specialized reasoning models (Medical, Legal, Finance), stop using simple accuracy rewards. Implementing a decoupled confidence head like DCPO is essential for safety and user trust.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that address the over-confidence and calibration degeneration issues specifically in Large Language Models trained with Reinforcement Learning.
What are the theoretical foundations of "Proper Scoring Rules" and "Fisher-metric inner products" in the context of neural network calibration, and how do they relate to gradient conflicts?
Explore instances where masked gradient strategies or decoupled reward structures have been applied to multi-task reinforcement learning or RLHF for objectives other than calibration.

[CVPR 2026] DCPO: Solving the Over-Confidence Crisis in Reasoning LLMs

1. TL;DR

2. The Arrogance of Reinforcement Learning

3. Methodology: The Power of Decoupling

4. Experimental Results: Having Your Cake and Eating It Too

5. Critical Analysis & Conclusion