The paper introduces POISE (Policy Optimization through Iterative Search and Evidence), an autonomous closed-loop framework where LLM agents discover, implement, and refine reinforcement learning (RL) algorithms for language models. Starting from a GRPO baseline, POISE evolved 64 candidate algorithms, ultimately discovering VM-AV-GRPO, which achieves a +4.6 Overall improvement and boosts AIME25 pass@32 from 26.7% to 43.3%.
TL;DR
Researchers from Fudan University have unveiled POISE (Policy Optimization through Iterative Search and Evidence), a framework that transforms LLMs from coding assistants into autonomous scientists. POISE doesn't just tune hyperparameters; it invents new RL mechanisms. By autonomously iterating through 64 generations, it discovered algorithms that significantly outperform the current state-of-the-art (SOTA) DeepSeek-style GRPO baseline, particularly in complex mathematical reasoning.
Background Positioning
In the current LLM landscape, Reinforcement Learning from Human Feedback (RLHF) and Group Relative Policy Optimization (GRPO) are the backbone of reasoning models. However, "researcher intuition" remains the bottleneck. POISE represents a shift toward Automated Research (AI for Science), specifically targeting the highly sensitive and stochastic domain of policy optimization.
Problem & Motivation: The "Black Box" of RL Research
Why is RL algorithm design so hard to automate? Unlike simple code search, RL components (loss functions, advantage estimators, KL penalties) are tightly coupled with training dynamics.
- Coupled Dynamics: A small change in normalization can lead to entropy collapse 2,000 steps later.
- Sparse Rewards: In hard math (e.g., AIME), most samples fail. Standard algorithms struggle to extract a signal from these "all-fail" groups.
- Evidence Decay: In manual research, the "why" behind a failed experiment is often lost. POISE solves this by treating every failure as Epistemic Evidence.
Methodology: The Epistemic Evolutionary Loop
POISE operates through a three-phase closed loop that mimics the scientific method.
1. Proposal Generation (Phase I)
Using a reflection-augmented evolutionary solver, it selects "parent" algorithms not just based on their score, but on their "descendant potential"—a Bayesian approach that favors lineages likely to yield future breakthroughs.
2. Implementation & Verification (Phase II)
The system generates executable Python code, ensuring it fits into a standardized training pipeline (based on VERL). Crucially, it includes a verification loop to check if the code actually reflects the intended mathematical logic.
3. Reflective Analysis (Phase III)
After training, an LLM agent analyzes the learning curves (entropy, reward, length) and generates a natural language "diagnosis." This reflection is stored in a genealogical archive, allowing future generations to learn from past mistakes.

Core Breakthroughs: Identifying New Mechanisms
POISE discovered several unique mechanisms that human researchers had overlooked:
- Analytic-Variance Scaling (AV-GRPO): Instead of using noisy batch statistics, it uses the theoretical reward distribution variance. This allows the model to amplify the "rare signals" of a correct answer in a sea of failures.
- Validity Masking (VM-AV-GRPO): It identifies that "format-correct but reasoning-wrong" samples often contaminate the learning signal. By masking these, it creates a strict gradient hierarchy: Invalid << Valid-Wrong < Valid-Correct.
Experimental Results
The results on mathematical reasoning benchmarks are striking. The discovered VM-AV-GRPO variant didn't just marginally improve performance; it decimated the baseline on the hardest tasks.

On AIME25, the pass@32 jumped from 26.7% to 43.3%. Process-level analysis (Figure 2) shows that these evolved variants maintain much healthier entropy levels and more stable reward trajectories compared to the standard GRPO.

The "AI Scientist" in Action: Targeted Steering
Perhaps the most impressive feat was steering. By providing a natural language directive: "Find an algorithm that is both accurate and concise," POISE discovered DACE-GRPO.
- Performance: +3.9 Overall gain.
- Efficiency: -29.1% reduction in response length. It achieved this by inventing "Correctness-first efficiency shaping"—only penalizing length for correct answers, preventing the model from becoming "short and stupid."
Critical Analysis & Conclusion
Takeaway
POISE proves that the "design patterns" of RL—signal decoupling, conditional normalization, and regime-aware scaling—can be discovered autonomously. It marks a transition where the scientist’s role is to define the objective space, while the AI explores the mechanism space.
Limitations
- Compute Cost: 64 training runs (each on 8x A100 GPUs) is out of reach for most researchers.
- Breadth: Currently limited to mathematical reasoning; its "creativity" in open-ended dialogue remains untested.
Future Outlook
POISE serves as a blueprint for the future of AI labs. As compute costs decrease, we can expect "Evolutionary RL" to become the standard for optimizing long-context reasoning models, potentially discovering mathematical loss functions that surpass current human understanding.
