Manifold-Aware Exploration for Reinforcement Learning in Video Generation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

[Tech Insight] SAGE-GRPO: Navigating the Video Manifold for Stable RL Alignment

总结

问题

方法

结果

要点

摘要

The paper introduces SAGE-GRPO (Stable Alignment via Exploration), a novel reinforcement learning framework designed to align video generation models with human preferences. By formulating video RL as a manifold-constrained exploration problem, it achieves SOTA alignment results on HunyuanVideo1.5, significantly improving visual quality and temporal consistency.

TL;DR

Post-training alignment for video generation has long been the "wild west" of Generative AI, plagued by temporal jitters and artifacts. SAGE-GRPO (Stable Alignment via Exploration) tames this by ensuring that reinforcement learning exploration never leaves the "safe zone" of the pre-trained video manifold. By introducing a curvature-corrected SDE and a dual-anchor trust region, it achieves massive gains in motion quality and semantic consistency.

The "Off-Manifold" Problem: Why Video RL Fails

In Large Language Models (LLMs), GRPO is a proven tool for alignment. However, when applied to video via Flow Matching models, a critical failure occurs. To explore different solutions, researchers convert deterministic ODE samplers into stochastic SDE samplers.

The problem? Previous methods (like FlowGRPO) use first-order approximations for noise injection. In the high-dimensional space of video, this "linear" noise is too blunt—it pushes the generated frames far away from the valid data manifold, resulting in "noisy" rollouts that confuse the reward model. If the samples are bad, the reward is meaningless, and the model never learns.

Micro-Level: Precision Noise Injection

SAGE-GRPO fixes this at the mathematical foundation. Instead of a simple area-based variance approximation, the authors derive a Precise Manifold-Aware SDE.

The Logarithmic Correction

By integrating diffusion coefficients over each step and adding a logarithmic curvature correction $lo g (\frac{1 - σ _{t + 1}}{1 - σ _{t}})$ , the exploration noise stays tangent to the flow trajectory. This keeps the "exploration ellipsoid" tight and manifold-consistent.

Model Architecture and Geometric Intuition Figure: Note how the blue ellipsoid (SAGE) stays aligned with the manifold $M$ , while the red sphere (FlowGRPO) drifts into "illegal" data space.

Gradient Norm Equalization

Diffusion models suffer from a massive gradient imbalance: gradients vanish at high noise ( $t o 1$ ) and explode at low noise ( $t o 0$ ). SAGE-GRPO introduces a Gradient Norm Equalizer that normalizes optimization pressure, ensuring that structural updates (high noise) and textural details (low noise) are learned with equal priority.

Macro-Level: The Dual Trust Region

Even with perfect noise, a model can "drift" over thousands of updates, eventually losing its ability to generate realistic videos (the stability-plasticity gap).

SAGE-GRPO implements a Position-Velocity Controller strategy:

Velocity Control (Step-wise KL): Limits how much the policy can change in a single step, ensuring smooth transitions.
Position Control (Periodical Moving Anchor): Every $N$ steps, a reference anchor is updated. This prevents the model from wandering too far from the initial manifold while allowing enough "plasticity" to maximize the reward.

Experimental Results: Setting a New Standard

The researchers tested SAGE-GRPO on the powerful HunyuanVideo1.5 model. Compared to current baselines like DanceGRPO and FlowGRPO, SAGE-GRPO showed dominant performance across the board.

Performance Comparison Table: SAGE-GRPO (Dual Mov KL) achieves an Overall score of 0.8066 in the alignment-focused setting, nearly doubling the performance of previous SOTA methods.

Qualitative Impact

Visually, this translates to videos that follow complex prompts (like "a soldier's features etched with fatigue") with far greater emotional depth and fewer temporal artifacts.

Visual Results Visual comparison: SAGE-GRPO (bottom right) preserves complex textures and lighting better than fixed-anchor or unconstrained versions.

Conclusion: A Manifold-First Future

SAGE-GRPO proves that the secret to successfully aligning complex generative models isn't just "more data" or "better rewards"—it's geometric awareness. By forcing the RL agent to stay within the boundaries of what the model already "knows" (the manifold), we can finally achieve stable, high-fidelity video alignment.

Future Outlook: This manifold-aware approach is likely to be adapted for other continuous-flow domains, such as audio generation and robotic trajectory planning, where stability is paramount.

发现相似论文

试试这些示例

Search for recent papers utilizing manifold-constrained reinforcement learning for high-dimensional generative modeling beyond video, such as 3D synthesis or protein folding.
Which original studies established the Group Relative Policy Optimization (GRPO) framework, and how does the ODE-to-SDE conversion in those works differ from current video generation baselines?
Explore research applying Dual Trust Region or position-velocity control mechanisms in online reinforcement learning to prevent catastrophic forgetting or reward hacking.

[Tech Insight] SAGE-GRPO: Navigating the Video Manifold for Stable RL Alignment

1. TL;DR

2. The "Off-Manifold" Problem: Why Video RL Fails

3. Micro-Level: Precision Noise Injection

3.1. The Logarithmic Correction

3.2. Gradient Norm Equalization

4. Macro-Level: The Dual Trust Region

5. Experimental Results: Setting a New Standard

5.1. Qualitative Impact

6. Conclusion: A Manifold-First Future