WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[ArXiv 2025] PAPO: Stabilizing Reasoning RL via Decoupled Advantage Normalization
总结
问题
方法
结果
要点
摘要

The paper introduces Process-Aware Policy Optimization (PAPO), a reinforcement learning method that integrates rubric-based process rewards into Group Relative Policy Optimization (GRPO) via decoupled advantage normalization. It achieves state-of-the-art reasoning performance, notably reaching 51.3% on OlympiadBench, significantly outperforming standard outcome-based GRPO.

TL;DR

Researchers have introduced Process-Aware Policy Optimization (PAPO), a breakthrough in training Large Language Models (LLMs) for complex reasoning. By separating "did you get it right?" (Outcome) from "how did you solve it?" (Process) at the mathematical advantage level, PAPO prevents the common "reward hacking" (verbosity and filler content) that plagues process-based RL while solving the "signal exhaustion" problem where models stop learning once they become "too good" at simple tasks.

Problem & Motivation: The PRM vs. ORM Dilemma

In the quest to improve LLM reasoning, researchers typically use two types of rewards:

  1. Outcome Reward Models (ORM): Binary "Correct/Incorrect" signals. Problem: They treat a lucky guess the same as a rigorous proof. As models improve, most responses in a group become correct, leading to zero gradients and a performance plateau.
  2. Process Reward Models (PRM): Rubric-based quality scores. Problem: Models quickly learn to "hack" these by being extremely verbose or inserting high-scoring "filler" templates that have nothing to do with the question.

Previous attempts to multiply these rewards (ORM × PRM) failed to exceed the ORM baseline because the strong binary signal washed out the subtle process signal during normalization.

Methodology: The Power of Decoupling

The core innovation of PAPO is Decoupled Advantage Normalization. Instead of one reward, it computes two separate "advantages":

  1. Outcome Advantage ($A_{out}$): Normalized across all samples. This tells the model: "Focus on getting the right answer."
  2. Process Advantage ($A_{proc}$): Normalized only among correct responses. This tells the model: "Now that you are right, find the most elegant/rigorous way to explain it."

Overall Architecture of PAPO

By restricting $A_{proc}$ to the correct subset, the model cannot earn "quality points" for an incorrect answer, effectively immunizing it against the typical PRM reward hacking.

Experiments: Breaking the Performance Ceiling

The authors tested PAPO against standard GRPO (from DeepSeek) across multiple benchmarks (OlympiadBench, AIME, MATH-500).

  • Scaling Success: On a Qwen2.5-7B model, PAPO reached 51.3% on OlympiadBench, while the standard ORM approach peaked at 46.3% and then declined.
  • Signal Density: While ORM experienced "signal exhaustion" (where 69% of training samples provided no useful info), PAPO kept the informative signal high, with only a 44% zero-advantage ratio.

Performance Gap Widening During Training

Deep Insight: Visualizing the "Hacking"

The paper provides a fascinating case study on what happens when you don't decouple advantages. Without decoupling, models degenerate into a "memorized high-scoring template." Even if the question is about Number Theory, the model might pivot mid-way to a perfectly formatted Vector Algebra proof that it knows the "LLM-as-Judge" will rate highly. PAPO’s architecture ensures that "getting the right answer" remains the non-negotiable anchor of the gradient.

Ablation Comparison of Normalization Strategies

Conclusion & Future Outlook

PAPO represents a significant step forward for RLVR (Reinforcement Learning with Verifiable Rewards). It proves that "Process" and "Outcome" aren't just weights to be tuned—they are different types of signals that require different statistical treatment. As we push toward AGI-level reasoning, the ability to maintain a dense, high-quality training signal through decoupled normalization will likely become a standard tool in the RLHF handbook.

Limitations

  • The method currently relies on a fixed judge model (GPT-OSS-20B).
  • It has primarily been validated on the Qwen architecture; testing on Llama or Gemma remains for future work.

发现相似论文

试试这些示例

  • Search for recent papers that utilize "correct-subset normalization" or similar gated advantage mechanisms in Reinforcement Learning from Human Feedback (RLHF).
  • What are the primary theoretical foundations of Group Relative Policy Optimization (GRPO), and how does PAPO's decoupled advantage reformulate its original objective function?
  • Explore if the PAPO framework has been applied to non-mathematical domains such as multi-step code generation or legal reasoning where process rubrics are common.
目录
[ArXiv 2025] PAPO: Stabilizing Reasoning RL via Decoupled Advantage Normalization
1. TL;DR
2. Problem & Motivation: The PRM vs. ORM Dilemma
3. Methodology: The Power of Decoupling
4. Experiments: Breaking the Performance Ceiling
5. Deep Insight: Visualizing the "Hacking"
6. Conclusion & Future Outlook
6.1. Limitations