WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[Google 2026] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
总结
问题
方法
结果
要点
摘要

VGGRPO (Visual Geometry GRPO) is a post-training framework that aligns video diffusion models for 4D world-consistent generation. It introduces a Latent Geometry Model (LGM) and leverages Group Relative Policy Optimization (GRPO) to achieve SOTA camera stability and geometric coherence, significantly outperforming existing alignment methods on dynamic scenes.

TL;DR

VGGRPO is a sophisticated post-training framework designed to fix the "hallucinated physics" of video diffusion models. By "stitching" a geometry foundation model directly into the VAE latent space, it enables real-time, 4D-aware reinforcement learning (GRPO). The result is a model that generates videos with rock-solid camera stability and perfect 3D consistency, even in highly dynamic real-world scenes, all while being 25% faster than previous RGB-based methods.

Problem & Motivation: The "Shaky Cam" of AI Video

While models like SORA or LTX-2 generate stunning visuals, they often fail the "physics test." Cameras jitter, buildings warp as the view shifts, and objects lose their 3D structure—a phenomenon known as geometric drift.

Previous attempts to solve this faced a "Triple Constraint":

  1. Efficiency: Computing rewards in RGB space requires decoding latents, which is slow and memory-intensive.
  2. Robustness: RGB-based rewards are brittle and sensitive to VAE artifacts.
  3. Dynamics: Most geometric constraints (like Epipolar geometry) assume the scene is static, failing miserably when objects move.

The authors' central insight: If the latent space already contains the structural information, why decode it?

Methodology: The Latent Geometry Model (LGM)

The architecture of VGGRPO is built on two pillars: the LGM for perception and GRPO for alignment.

1. Model Stitching in the Latent Space

Instead of using a standard geometry model that takes images as input, the authors "stitch" the latent space of the video VAE to a Geometry Foundation Model (like Any4D). They learn a 3D Convolutional Connector that maps VAE latents to the intermediate transformer features of the geometry model. This allows the system to "see" 3D structure (depth, poses, point maps) directly from the compressed latents.

Overall Architecture

2. Dual Latent-Space Rewards

Using the LGM, the framework calculates two critical rewards during training:

  • Camera Motion Smoothness ($r_{motion}$): Discourages sudden spikes in camera acceleration and angular velocity.
  • Geometry Reprojection Consistency ($r_{geo}$): Projects the 3D point cloud predicted in one frame into another; if the depths don't match, the model is penalized.

Experiments: Breaking the Static-Scene Barrier

The researchers tested VGGRPO on both static and dynamic benchmarks. While traditional methods like Epipolar-DPO perform well on static houses, they collapse when tracking a snowboarder. VGGRPO, powered by 4D-aware priors, maintains structural integrity across both.

Key Performance Metrics:

  • Inference Efficiency: Latent-space rewards are 24.5% faster and use ~8GB less VRAM than RGB-based alignment.
  • Win Rates: On the Wan2.2-5B backbone, VGGRPO achieved a 68.42% win rate in Motion Quality against the base model.
  • Generalization: Unlike Supervised Fine-Tuning (SFT), which often causes models to lose their "creative spark," VGGRPO preserved or improved VBench aesthetic scores.

Qualitative Comparison

Deep Insight: Why This Matters

The shift from Pixel-space Alignment to Latent-space Alignment is a major milestone. By treating the latent space as a First-Class Citizen for geometric reasoning, VGGRPO proves that we don't need to choose between efficiency and world-consistency.

Furthermore, the use of GRPO (Group Relative Policy Optimization) removes the need for a "Critic" network, making the alignment of multi-billion parameter video models feasible on standard academic/corporate clusters.

Limitations & Future Work

While VGGRPO handles dynamic scenes, it still relies on the quality of the underlying "Geometry Foundation Model." If the foundation model misinterprets a complex reflection as a 3D object, the reward signal will be flawed. Future work will likely involve "Closed-loop" training where the geometry model and the video generator are co-evolved.

Takeaway for Practitioners

If you are building Video-to-3D pipelines or Embodied AI simulators, VGGRPO is the current gold standard for ensuring your "world model" doesn't melt when the camera moves.

发现相似论文

试试这些示例

  • Search for recent papers that utilize "model stitching" to connect generative latent spaces with frozen foundation models for discriminative tasks.
  • What are the primary theoretical limitations of using Group Relative Policy Optimization (GRPO) compared to PPO in high-dimensional video generation tasks?
  • Explore newer geometry foundation models beyond Any4D and VGGT that support dynamic 4D scene reconstruction from monocular video.
目录
[Google 2026] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
1. TL;DR
2. Problem & Motivation: The "Shaky Cam" of AI Video
3. Methodology: The Latent Geometry Model (LGM)
3.1. 1. Model Stitching in the Latent Space
3.2. 2. Dual Latent-Space Rewards
4. Experiments: Breaking the Static-Scene Barrier
4.1. Key Performance Metrics:
5. Deep Insight: Why This Matters
6. Limitations & Future Work
6.1. Takeaway for Practitioners