RealWonder is the first real-time (13.2 FPS) system for physical action-conditioned video generation from a single image. It uses physics simulation as an intermediate bridge to translate 3D forces and robotic actions into visual representations (optical flow/RGB) for a distilled video generator.
TL;DR
RealWonder is a breakthrough "Interactive World Model" that allows users to apply 3D forces, torques, and robotic actions to a single image and see the results in real-time. By using a physics engine as a "translator" between the physical and visual worlds, it achieves photorealistic, action-consistent video streaming at 13.2 FPS.
The "Action-Visual" Mismatch
Modern Video Diffusion Models (VDMs) are masters of pixels but "physics-blind." When you tell a standard VDM to "push a boat to the right," it often defaults to the most likely data pattern (the boat moving forward) or ignores the force's magnitude entirely.
The authors identify two core bottlenecks:
- The Tokenization Problem: 3D forces are continuous and unbounded. Unlike text or discrete camera poses, they don't fit neatly into traditional transformer tokens.
- The Data Scarcity: There are almost no massive, high-quality datasets that pair raw physical force vectors (Newtons) with real-world video.
Methodology: Physics as the Universal Intermediate
RealWonder's key insight is to stop trying to make the neural network learn physics from scratch. Instead, it uses a Physics Simulator as a bridge.
1. 3D Scene Reconstruction
The pipeline starts by lifting a 2D image into 3D. It segments objects using SAM 2, estimates depth, and uses a Vision-Language Model (VLM) to guess material properties (e.g., is this cloth, liquid, or a rigid brick?).
2. The Physics Bridge
When a user applies an action (like a robot gripper closing), the Genesis physics engine calculates the movement. This "invisible" 3D movement is projected back into 2D as:
- Optical Flow (Ft): Showing where pixels should move.
- Coarse RGB Preview (á¹¼t): Providing structural cues like occlusions.

3. Real-Time Distilled Generator
To make this interactive, the authors distilled a heavy video model into a 4-step causal student. By using Flow-based Noise Warping, they inject the physics-derived motion directly into the diffusion process, ensuring the AI "paints" the objects exactly where the physics engine says they should be.
Experimental Performance
The system was tested across diverse materials: rigid bodies, fluids, granular sand, and deformable cloth.
| Metric | CogVideoX-I2V | Tora (Drag-based) | RealWonder (Ours) | | :--- | :--- | :--- | :--- | | FPS | 0.225 | 0.107 | 13.2 | | Phys. Plausibility | 0.624 | 0.578 | 0.705 |

As shown above, while baselines like Tora struggle to interpret 3D direction correctly (often moving boats forward instead of sideways), RealWonder strictly follows the simulated trajectory while the generative model adds realistic water splashes and lighting effects.
Deep Insights: Why It Works
The "magic" of RealWonder is its Inductive Bias. By forcing the video model to condition on optical flow derived from a real physics engine, the model doesn't have to "guess" the laws of gravity or collision—it only has to focus on making the movement look photorealistic (adding shadows, textures, and fluid dynamics).
The SDEdit-based RGB conditioning further ensures that even if the physics simulation is "low-poly" or coarse, the final video remains visually grounded in the original high-resolution image.
Conclusion & Future Work
RealWonder represents a shift in World Modeling: moving from "black-box" end-to-end learning toward a hybrid architecture where simulators handle the logic and neural networks handle the appearance.
Limitations: The system's accuracy is still dependent on the initial 3D reconstruction. If the depth estimation is wrong, the physics will be "floaty." However, as 3D reconstruction models (like LRM or DUSt3R) improve, RealWonder will scale proportionally—providing a clear path toward truly interactive, physically-accurate AI agents.
For more details, visit the project page: liuwei283.github.io/RealWonder
