Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

[CVPR 2024] Mesh-Pro: Revolutionizing 3D Mesh Generation with Asynchronous Online RL and ARPO

总结

问题

方法

结果

要点

摘要

Mesh-Pro is an end-to-end 3D mesh generation framework that introduces the first asynchronous online Reinforcement Learning (RL) system tailored for 3D assets. It utilizes a novel Advantage-guided Ranking Preference Optimization (ARPO) algorithm and a diagonal-aware tokenization scheme to achieve State-of-the-Art (SOTA) performance in producing artist-style quadrilateral-dominant meshes.

TL;DR

Mesh-Pro marks a significant leap in 3D content creation by shifting from static offline fine-tuning to a high-efficiency asynchronous online RL framework. By introducing ARPO (Advantage-guided Ranking Preference Optimization) and a novel diagonal-aware tokenization, it generates artist-style quadrilateral meshes that are topologically sound, geometrically faithful, and optimized for downstream animation and UV unwrapping.

Problem & Motivation: The "Idle GPU" and "Generalization" Traps

In the world of LLMs, RLHF is the standard for alignment. However, in 3D mesh generation, we face two unique "walls":

The Synchronous Bottleneck: 3D meshes have wildly different token lengths. In a standard synchronous RL setup, the trainer must wait for the longest rollout to finish. This leads to massive GPU idle time and makes online training prohibitively slow.
The Generalization Gap: Previous SOTA methods like QuadGPT used offline DPO. While stable, DPO only models rewards implicitly. It often fails to generalize to "out-of-distribution" geometries because it can't explicitly understand the underlying reward distribution.

Mesh-Pro's authors realized that to produce meshes that look like they were made by a human artist, the model needs to "explore" the space of possible topologies dynamically and be rewarded for structural integrity.

Methodology: The Core Innovations

1. Asynchronous Online RL Framework

To solve the efficiency problem, Mesh-Pro decouples the Rollout Workers (which generate data) from the Trainer Workers (which update the policy).

Continuous Flow: Rollout workers use the latest available policy to fill a replay buffer.
No Waiting: Trainers sample from this buffer continuously. This architecture makes Mesh-Pro 3.75× faster than synchronous methods.

Asynchronous Framework

2. ARPO: The Best of Both Worlds

ARPO is the mathematical heartbeat of this paper. It seeks to bridge the gap between DPO (Fast but low generalization) and GRPO (Explicit but slow to converge).

Ranking Logic: It uses a Plackett-Luce ranking model to ensure stable convergence.
Advantage Guidance: It explicitly weights updates using an advantage function ( $A_{i}$ ). High-quality "dominance" samples get more weight, pushing the model toward artist-level topology faster than any previous RL algorithm.

3. Diagonal-Aware Tokenization

Prior methods forced the model to decide if a face was a triangle or a quad before generating vertices. Mesh-Pro uses a "generate-then-decide" strategy. It generates three vertices, then uses a special flag in the fourth position to determine if it's a triangle or how a quad should be split diagonally. This preserves a canonical ordering and prevents geometric artifacts.

Tokenization Method

Experiments & Results: SOTA Performance

Mesh-Pro was tested against strong baselines like MeshAnything v2, Mesh-RFT, and QuadGPT.

Geometric Integrity: Thanks to a new Ray-based Reward, the "Broken Ratio" (meshes with holes or non-manifold surfaces) dropped significantly.
Artist Quality: In User Studies, human experts chose Mesh-Pro's outputs as the most "artist-like" due to its structured edge flow (quad rings and quad lines).

Qualitative Comparison

The ablation studies (Table 2 in the paper) confirm that removing either the Asynchronous mechanism or the Advantage guidance leads to a sharp decline in topological quality (Quad Ratio) and user preference.

Critical Analysis & Conclusion

Why it works

The genius of Mesh-Pro lies in its system-level optimization. By treating 3D generation as a high-throughput RL task rather than just a supervised sequence task, it allows the model to learn "aesthetic" traits (like clean edge flow) that are impossible to capture through simple cross-entropy loss on a static dataset.

Limitations

Group Size: ARPO is currently limited by GPU memory, preventing very large group comparisons.
Face Count Control: The model doesn't yet allow users to specify a target polygon count (e.g., "give me a low-poly version of this chair").

Future Outlook

The shift towards asynchronous online RL is likely the "AlphaGo moment" for 3D generation. As we move towards generative models for gaming and robotics, the ability to align models with physical and aesthetic rewards—rather than just mimicking training data—will be the defining factor of the next generation of 3D AI.

发现相似论文

试试这些示例

Examine recent papers that apply asynchronous Reinforcement Learning from Human Feedback (RLHF) frameworks to non-textual autoregressive generation tasks.
Trace the evolution of the Plackett-Luce model in preference optimization, specifically comparing ARPO's advantage-weighting to the original DPO formulation.
Search for 3D generative models that utilize ray-casting or differentiable rendering as a core reward signal for geometric integrity and manifoldness.

[CVPR 2024] Mesh-Pro: Revolutionizing 3D Mesh Generation with Asynchronous Online RL and ARPO

1. TL;DR

2. Problem & Motivation: The "Idle GPU" and "Generalization" Traps

3. Methodology: The Core Innovations

3.1. 1. Asynchronous Online RL Framework

3.2. 2. ARPO: The Best of Both Worlds

3.3. 3. Diagonal-Aware Tokenization

4. Experiments & Results: SOTA Performance

5. Critical Analysis & Conclusion

5.1. Why it works

5.2. Limitations

5.3. Future Outlook