MoRight: Motion Control Done Right

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

MoRight: Motion Control Done Right

[NVIDIA 2025] MoRight: Disentangling Camera Control and Causal Reasoning in Video Generation

总结

问题

方法

结果

要点

摘要

MoRight is a unified controllable video generation framework that achieves disentangled camera-object control and motion causality reasoning. By utilizing a dual-stream DiT-based architecture, it enables users to specify object trajectories in a canonical static view while independently adjusting target camera viewpoints, achieving SOTA performance on benchmarks like DynPose-100K and WISA.

TL;DR

MoRight is a breakthrough in controllable video generation that finally separates How the camera moves from How objects interact. Unlike previous models that just "track pixels," MoRight understands Motion Causality—it can predict that pushing a ball will make it roll (Forward Reasoning) or figure out what hand movement caused a cloth to move (Inverse Reasoning). It uses a dual-stream architecture to keep camera and object controls independent, setting a new SOTA for physically grounded video synthesis.

The Problem: The "Pixel Tracking" Trap

Current SOTA models (like WanMove or Motion Prompting) operate essentially as high-end "pixel-warpers." They face two critical hurdles:

The Entanglement Problem: In these models, a trajectory is just a path on a 2D screen. If you change the camera angle, the entire trajectory changes. You cannot simply say "move the object left" and "zoom in" separately without the math breaking down.
The "Puppet" Problem: They lack physical intuition. If a user draws a trajectory for a hand, the model might move the hand but leave the object it's touching static. It doesn't realize that Action leads to Consequence.

Methodology: The Dual-Stream & Causal Engine

MoRight's architecture is built on the Wan2.1-14B backbone but introduces a clever Dual-Stream configuration.

1. Disentangled Camera-Object Control

Instead of one video stream, MoRight processes two:

Canonical Stream: Generates the video from a fixed, static viewpoint. Here, the object motion is "pure" and unambiguous.
Target Stream: Generates the video from the user's requested camera path. The magic happens in the Temporal Cross-View Attention blocks, where the Target Stream "looks" at the Canonical Stream to understand the object's physical displacement, then applies its own camera transformation.

Model Architecture Figure 1: The dual-stream DiT architecture allows motion to be anchored in a canonical view while the target view handles camera dynamics.

2. Motion Causality (Active vs. Passive)

To teach the model physics, the authors curated data by labeling objects as Active (the "input" like a hand) or Passive (the "result" like a moving teapot). During training, they use Motion Dropout: sometimes they hide the hand's path and ask the model to generate it based on the teapot's movement, and sometimes they do the opposite.

This enables two powerful modes:

Forward Reasoning: You draw a "push" and the model predicts the "slide."
Inverse Reasoning: You draw the desired path of a ball, and the model hallucinates the hand or force that must have caused it.

Motion Causality Reasoning Figure 2: MoRight can reason about consequences (Forward) and underlying causes (Inverse).

Experiments & Evaluation

MoRight was tested on the DynPose-100K (dynamic cameras), WISA (physical interactions), and a custom Cooking benchmark.

Performance: It outperformed baselines in "Physical Commonsense" (PC) and "Semantic Adherence" (SA). In human studies, it was preferred over existing SOTA methods (ATI, WanMove) across all categories: Controllability, Motion Realism, and Photorealism.
Refining the Input: Unlike other models that require "privileged information" (knowing the future 3D path), MoRight works with simple 2D strokes on the first frame.

Experiment Results Figure 3: Qualitative results show how MoRight maintains object consistency even as the camera orbits or zooms.

Critical Insight & Future Outlook

The most impressive part of MoRight isn't just the "pretty pixels"; it's the shift towards Interactivity. By decomposing motion into active and passive components, NVIDIA is moving closer to a "Foundation World Model" that can be used by robots to simulate "What if I do X?"

Limitations: The model can still hallucinate (e.g., extra hands appearing) or fail in extremely high-speed, chaotic ego-motion scenes. However, as an I2V (Image-to-Video) control framework, it sets the current gold standard for how we should think about 3D space in a 2D diffusion world.

Conclusion

MoRight proves that for video generation to be useful for AI agents and creators, we must respect the laws of physics and the geometry of cameras. By disentangling these factors, we gain a tool that doesn't just animate—it reasons.

发现相似论文

试试这些示例

Find recent papers that attempt to solve the entanglement of camera and object motion in video generation using 3D-aware latent representations or dual-stream architectures.
Which paper first proposed the concept of "Active vs. Passive" motion decomposition in generative modeling, and how does MoRight improve upon its causal grounding?
Explore studies that have integrated MoRight's cross-view attention mechanism or motion causality reasoning into Embodied AI frameworks for robot action planning.

[NVIDIA 2025] MoRight: Disentangling Camera Control and Causal Reasoning in Video Generation

1. TL;DR

2. The Problem: The "Pixel Tracking" Trap

3. Methodology: The Dual-Stream & Causal Engine

3.1. 1. Disentangled Camera-Object Control

3.2. 2. Motion Causality (Active vs. Passive)

4. Experiments & Evaluation

5. Critical Insight & Future Outlook

6. Conclusion