WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2026] P2O: Bridging the Exploration Gap in LLM Reasoning via Joint Policy and Prompt Optimization
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces P2O (Joint Policy and Prompt Optimization), a framework that integrates Genetic-Pareto (GEPA) prompt evolution with Reinforcement Learning (RLVR/GRPO). By dynamically optimizing prompts for "hard samples" where standard RL fails, P2O achieves state-of-the-art results in mathematical reasoning, specifically improving AIME accuracy by up to 12.3%.

TL;DR

Reinforcement Learning (RL) often fails on "hard samples" where the model cannot find a single correct answer to start learning. P2O (Joint Policy and Prompt Optimization) solves this by evolving custom prompts that "help" the model solve these hard problems during training. Crucially, it then uses Context Distillation to teach the model how to solve them without those prompts, leading to massive gains on benchmarks like AIME (+12.3%).

Analysis of the Exploration Bottleneck: The "Zero-Advantage" Trap

In Reinforcement Learning for reasoning (RLVR), the model learns by trying different paths and getting rewarded for correct answers. However, for complex math problems (Hard Samples), the model's success rate is often exactly 0%.

Mathematically, when successes are zero, the advantage estimate $\hat{A}$ vanishes, and the gradient becomes: $$ abla_{ heta} J(x) \approx \mathbb{E} [ (r - b) abla \log \pi ] \approx 0$$ This leaves the model "starved" of signals for difficult tasks, causing it to overfit on simple problems and stay trapped in a local optimum.

Methodology: The P2O Virtuous Cycle

P2O breaks this stalemate by treating the Prompt as a dynamic latent variable. Instead of just changing the model weights, it changes the "question" to unlock the "answer."

1. Evolutionary Prompt Optimization (GEPA)

P2O identifies hard samples and uses the Genetic-Pareto (GEPA) algorithm. It uses a "Reflection LLM" to look at why the model failed and suggest better prompts (mutations). It maintains a "Pareto Front" of the most effective prompts to ensure diversity.

2. Context Distillation: Internalizing the Gains

This is the "secret sauce." If you just train on the hard prompt, the model becomes dependent on it. P2O generates a successful trajectory using the Augmented Input ($x + z$) but calculates the gradient update against the Original Input ($x$).

P2O Framework Overview Figure 1: The P2O Framework alternating between Policy Optimization and Prompt Evolution.

Experiments: Conquering the Hardest Problems

The authors tested P2O on Qwen3-4B across several math benchmarks. The most impressive results came from high-difficulty datasets like AIME, which represent the current frontier of LLM reasoning.

Performance Comparison Table

Key Findings:

  • AIME Accuracy: P2O achieved a massive relative improvement, jumping from 46.9% (GRPO) to nearly 60% on AIME24.
  • Self-Ref vs. Teacher-Ref: Interestingly, using the model's own reference to evolve prompts (Self-Ref) was sometimes better than using a stronger "Teacher" (Kimi-K2), suggesting that the best prompts are often those most aligned with the model's internal capability.

Deep Insight: Moving Through the Reward Valley

Standard RL is like a hiker trying to climb a mountain in thick fog; if they don't see a path, they stand still. P2O's optimized prompts act as a "red arrow" (as seen in Figure 1 of the paper), allowing the model to "jump" across a valley of zero rewards to find a secondary, higher peak of performance.

Training Dynamics Figure 2: Training Dynamics showing P2O maintaining higher rewards and better validation accuracy early on.

Conclusion and Future Outlook

P2O proves that the limitation of many current models isn't necessarily a lack of "knowledge," but a lack of "accessibility." By optimizing the prompt and the policy jointly, we can unlock latent reasoning paths and distill them into the base model.

Limitations: The evolutionary process for prompts is computationally intensive, requiring many rollouts. Future work might explore more efficient ways to predict "optimal prompts" without a full genetic search.

Takeaway for Practitioners: If your RL model isn't improving on hard tasks, don't just add more data—try optimizing the instructions used during the exploration phase.

Find Similar Papers

Try Our Examples

  • Search for recent papers that use prompt optimization as a supervision signal for fine-tuning or distilling reasoning capabilities in Large Language Models.
  • Which paper originally proposed Group Relative Policy Optimization (GRPO), and how does P2O modify its exploration strategy for hard samples?
  • Explore if the P2O framework or similar joint policy-prompt optimization techniques have been applied to code generation or theorem proving tasks.
Contents
[CVPR 2026] P2O: Bridging the Exploration Gap in LLM Reasoning via Joint Policy and Prompt Optimization
1. TL;DR
2. Analysis of the Exploration Bottleneck: The "Zero-Advantage" Trap
3. Methodology: The P2O Virtuous Cycle
3.1. 1. Evolutionary Prompt Optimization (GEPA)
3.2. 2. Context Distillation: Internalizing the Gains
4. Experiments: Conquering the Hardest Problems
5. Deep Insight: Moving Through the Reward Valley
6. Conclusion and Future Outlook