WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[Fudan University] POISE: The Era of the AI Scientist - Automating the Discovery of RL Algorithms
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces POISE (Policy Optimization through Iterative Search and Evidence), an autonomous closed-loop framework where LLM agents discover, implement, and refine reinforcement learning (RL) algorithms for language models. Starting from a GRPO baseline, POISE evolved 64 candidate algorithms, ultimately discovering VM-AV-GRPO, which achieves a +4.6 Overall improvement and boosts AIME25 pass@32 from 26.7% to 43.3%.

TL;DR

Researchers from Fudan University have unveiled POISE (Policy Optimization through Iterative Search and Evidence), a framework that transforms LLMs from coding assistants into autonomous scientists. POISE doesn't just tune hyperparameters; it invents new RL mechanisms. By autonomously iterating through 64 generations, it discovered algorithms that significantly outperform the current state-of-the-art (SOTA) DeepSeek-style GRPO baseline, particularly in complex mathematical reasoning.

Background Positioning

In the current LLM landscape, Reinforcement Learning from Human Feedback (RLHF) and Group Relative Policy Optimization (GRPO) are the backbone of reasoning models. However, "researcher intuition" remains the bottleneck. POISE represents a shift toward Automated Research (AI for Science), specifically targeting the highly sensitive and stochastic domain of policy optimization.

Problem & Motivation: The "Black Box" of RL Research

Why is RL algorithm design so hard to automate? Unlike simple code search, RL components (loss functions, advantage estimators, KL penalties) are tightly coupled with training dynamics.

  1. Coupled Dynamics: A small change in normalization can lead to entropy collapse 2,000 steps later.
  2. Sparse Rewards: In hard math (e.g., AIME), most samples fail. Standard algorithms struggle to extract a signal from these "all-fail" groups.
  3. Evidence Decay: In manual research, the "why" behind a failed experiment is often lost. POISE solves this by treating every failure as Epistemic Evidence.

Methodology: The Epistemic Evolutionary Loop

POISE operates through a three-phase closed loop that mimics the scientific method.

1. Proposal Generation (Phase I)

Using a reflection-augmented evolutionary solver, it selects "parent" algorithms not just based on their score, but on their "descendant potential"—a Bayesian approach that favors lineages likely to yield future breakthroughs.

2. Implementation & Verification (Phase II)

The system generates executable Python code, ensuring it fits into a standardized training pipeline (based on VERL). Crucially, it includes a verification loop to check if the code actually reflects the intended mathematical logic.

3. Reflective Analysis (Phase III)

After training, an LLM agent analyzes the learning curves (entropy, reward, length) and generates a natural language "diagnosis." This reflection is stored in a genealogical archive, allowing future generations to learn from past mistakes.

POISE Framework Architecture

Core Breakthroughs: Identifying New Mechanisms

POISE discovered several unique mechanisms that human researchers had overlooked:

  • Analytic-Variance Scaling (AV-GRPO): Instead of using noisy batch statistics, it uses the theoretical reward distribution variance. This allows the model to amplify the "rare signals" of a correct answer in a sea of failures.
  • Validity Masking (VM-AV-GRPO): It identifies that "format-correct but reasoning-wrong" samples often contaminate the learning signal. By masking these, it creates a strict gradient hierarchy: Invalid << Valid-Wrong < Valid-Correct.

Experimental Results

The results on mathematical reasoning benchmarks are striking. The discovered VM-AV-GRPO variant didn't just marginally improve performance; it decimated the baseline on the hardest tasks.

Experimental Results Table

On AIME25, the pass@32 jumped from 26.7% to 43.3%. Process-level analysis (Figure 2) shows that these evolved variants maintain much healthier entropy levels and more stable reward trajectories compared to the standard GRPO.

Training Dynamics Comparison

The "AI Scientist" in Action: Targeted Steering

Perhaps the most impressive feat was steering. By providing a natural language directive: "Find an algorithm that is both accurate and concise," POISE discovered DACE-GRPO.

  • Performance: +3.9 Overall gain.
  • Efficiency: -29.1% reduction in response length. It achieved this by inventing "Correctness-first efficiency shaping"—only penalizing length for correct answers, preventing the model from becoming "short and stupid."

Critical Analysis & Conclusion

Takeaway

POISE proves that the "design patterns" of RL—signal decoupling, conditional normalization, and regime-aware scaling—can be discovered autonomously. It marks a transition where the scientist’s role is to define the objective space, while the AI explores the mechanism space.

Limitations

  1. Compute Cost: 64 training runs (each on 8x A100 GPUs) is out of reach for most researchers.
  2. Breadth: Currently limited to mathematical reasoning; its "creativity" in open-ended dialogue remains untested.

Future Outlook

POISE serves as a blueprint for the future of AI labs. As compute costs decrease, we can expect "Evolutionary RL" to become the standard for optimizing long-context reasoning models, potentially discovering mathematical loss functions that surpass current human understanding.

Find Similar Papers

Try Our Examples

  • Find recent papers on automated machine learning (AutoML) or "AI Scientist" agents specifically focused on discovering new optimization loss functions or reinforcement learning objectives.
  • What are the theoretical foundations of Analytic-variance scaling in advantage estimation, and how does it compare to empirical batch normalization in RLHF?
  • Search for research exploring the trade-off between reasoning chain length and accuracy in LLM "thinking" models and how reward shaping is used to control verbosity.
Contents
[Fudan University] POISE: The Era of the AI Scientist - Automating the Discovery of RL Algorithms
1. TL;DR
2. Background Positioning
3. Problem & Motivation: The "Black Box" of RL Research
4. Methodology: The Epistemic Evolutionary Loop
4.1. 1. Proposal Generation (Phase I)
4.2. 2. Implementation & Verification (Phase II)
4.3. 3. Reflective Analysis (Phase III)
5. Core Breakthroughs: Identifying New Mechanisms
5.1. Experimental Results
6. The "AI Scientist" in Action: Targeted Steering
7. Critical Analysis & Conclusion
7.1. Takeaway
7.2. Limitations
7.3. Future Outlook