The paper introduces P2O (Joint Policy and Prompt Optimization), a framework that integrates Genetic-Pareto (GEPA) prompt evolution with Reinforcement Learning (RLVR/GRPO). By dynamically optimizing prompts for "hard samples" where standard RL fails, P2O achieves state-of-the-art results in mathematical reasoning, specifically improving AIME accuracy by up to 12.3%.
TL;DR
Reinforcement Learning (RL) often fails on "hard samples" where the model cannot find a single correct answer to start learning. P2O (Joint Policy and Prompt Optimization) solves this by evolving custom prompts that "help" the model solve these hard problems during training. Crucially, it then uses Context Distillation to teach the model how to solve them without those prompts, leading to massive gains on benchmarks like AIME (+12.3%).
Analysis of the Exploration Bottleneck: The "Zero-Advantage" Trap
In Reinforcement Learning for reasoning (RLVR), the model learns by trying different paths and getting rewarded for correct answers. However, for complex math problems (Hard Samples), the model's success rate is often exactly 0%.
Mathematically, when successes are zero, the advantage estimate $\hat{A}$ vanishes, and the gradient becomes: $$ abla_{ heta} J(x) \approx \mathbb{E} [ (r - b) abla \log \pi ] \approx 0$$ This leaves the model "starved" of signals for difficult tasks, causing it to overfit on simple problems and stay trapped in a local optimum.
Methodology: The P2O Virtuous Cycle
P2O breaks this stalemate by treating the Prompt as a dynamic latent variable. Instead of just changing the model weights, it changes the "question" to unlock the "answer."
1. Evolutionary Prompt Optimization (GEPA)
P2O identifies hard samples and uses the Genetic-Pareto (GEPA) algorithm. It uses a "Reflection LLM" to look at why the model failed and suggest better prompts (mutations). It maintains a "Pareto Front" of the most effective prompts to ensure diversity.
2. Context Distillation: Internalizing the Gains
This is the "secret sauce." If you just train on the hard prompt, the model becomes dependent on it. P2O generates a successful trajectory using the Augmented Input ($x + z$) but calculates the gradient update against the Original Input ($x$).
Figure 1: The P2O Framework alternating between Policy Optimization and Prompt Evolution.
Experiments: Conquering the Hardest Problems
The authors tested P2O on Qwen3-4B across several math benchmarks. The most impressive results came from high-difficulty datasets like AIME, which represent the current frontier of LLM reasoning.

Key Findings:
- AIME Accuracy: P2O achieved a massive relative improvement, jumping from 46.9% (GRPO) to nearly 60% on AIME24.
- Self-Ref vs. Teacher-Ref: Interestingly, using the model's own reference to evolve prompts (Self-Ref) was sometimes better than using a stronger "Teacher" (Kimi-K2), suggesting that the best prompts are often those most aligned with the model's internal capability.
Deep Insight: Moving Through the Reward Valley
Standard RL is like a hiker trying to climb a mountain in thick fog; if they don't see a path, they stand still. P2O's optimized prompts act as a "red arrow" (as seen in Figure 1 of the paper), allowing the model to "jump" across a valley of zero rewards to find a secondary, higher peak of performance.
Figure 2: Training Dynamics showing P2O maintaining higher rewards and better validation accuracy early on.
Conclusion and Future Outlook
P2O proves that the limitation of many current models isn't necessarily a lack of "knowledge," but a lack of "accessibility." By optimizing the prompt and the policy jointly, we can unlock latent reasoning paths and distill them into the base model.
Limitations: The evolutionary process for prompts is computationally intensive, requiring many rollouts. Future work might explore more efficient ways to predict "optimal prompts" without a full genetic search.
Takeaway for Practitioners: If your RL model isn't improving on hard tasks, don't just add more data—try optimizing the instructions used during the exploration phase.
