RealWonder: Real-Time Physical Action-Conditioned Video Generation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

RealWonder: Real-Time Physical Action-Conditioned Video Generation

[arXiv 2026] RealWonder: Bridging the Gap Between 3D Physics and Real-Time Video Generation

Summary

Problem

Method

Results

Takeaways

Abstract

RealWonder is the first real-time (13.2 FPS) system for physical action-conditioned video generation from a single image. It uses physics simulation as an intermediate bridge to translate 3D forces and robotic actions into visual representations (optical flow/RGB) for a distilled video generator.

TL;DR

RealWonder is a breakthrough "Interactive World Model" that allows users to apply 3D forces, torques, and robotic actions to a single image and see the results in real-time. By using a physics engine as a "translator" between the physical and visual worlds, it achieves photorealistic, action-consistent video streaming at 13.2 FPS.

The "Action-Visual" Mismatch

Modern Video Diffusion Models (VDMs) are masters of pixels but "physics-blind." When you tell a standard VDM to "push a boat to the right," it often defaults to the most likely data pattern (the boat moving forward) or ignores the force's magnitude entirely.

The authors identify two core bottlenecks:

The Tokenization Problem: 3D forces are continuous and unbounded. Unlike text or discrete camera poses, they don't fit neatly into traditional transformer tokens.
The Data Scarcity: There are almost no massive, high-quality datasets that pair raw physical force vectors (Newtons) with real-world video.

Methodology: Physics as the Universal Intermediate

RealWonder's key insight is to stop trying to make the neural network learn physics from scratch. Instead, it uses a Physics Simulator as a bridge.

1. 3D Scene Reconstruction

The pipeline starts by lifting a 2D image into 3D. It segments objects using SAM 2, estimates depth, and uses a Vision-Language Model (VLM) to guess material properties (e.g., is this cloth, liquid, or a rigid brick?).

2. The Physics Bridge

When a user applies an action (like a robot gripper closing), the Genesis physics engine calculates the movement. This "invisible" 3D movement is projected back into 2D as:

Optical Flow (Ft): Showing where pixels should move.
Coarse RGB Preview (Ṽt): Providing structural cues like occlusions.

System Architecture

3. Real-Time Distilled Generator

To make this interactive, the authors distilled a heavy video model into a 4-step causal student. By using Flow-based Noise Warping, they inject the physics-derived motion directly into the diffusion process, ensuring the AI "paints" the objects exactly where the physics engine says they should be.

Experimental Performance

The system was tested across diverse materials: rigid bodies, fluids, granular sand, and deformable cloth.

| Metric | CogVideoX-I2V | Tora (Drag-based) | RealWonder (Ours) | | :--- | :--- | :--- | :--- | | FPS | 0.225 | 0.107 | 13.2 | | Phys. Plausibility | 0.624 | 0.578 | 0.705 |

Experimental Results

As shown above, while baselines like Tora struggle to interpret 3D direction correctly (often moving boats forward instead of sideways), RealWonder strictly follows the simulated trajectory while the generative model adds realistic water splashes and lighting effects.

Deep Insights: Why It Works

The "magic" of RealWonder is its Inductive Bias. By forcing the video model to condition on optical flow derived from a real physics engine, the model doesn't have to "guess" the laws of gravity or collision—it only has to focus on making the movement look photorealistic (adding shadows, textures, and fluid dynamics).

The SDEdit-based RGB conditioning further ensures that even if the physics simulation is "low-poly" or coarse, the final video remains visually grounded in the original high-resolution image.

Conclusion & Future Work

RealWonder represents a shift in World Modeling: moving from "black-box" end-to-end learning toward a hybrid architecture where simulators handle the logic and neural networks handle the appearance.

Limitations: The system's accuracy is still dependent on the initial 3D reconstruction. If the depth estimation is wrong, the physics will be "floaty." However, as 3D reconstruction models (like LRM or DUSt3R) improve, RealWonder will scale proportionally—providing a clear path toward truly interactive, physically-accurate AI agents.

For more details, visit the project page: liuwei283.github.io/RealWonder

Find Similar Papers

Try Our Examples

Search for recent papers using physics simulations as intermediate representations for video generation or world models.
Which paper first introduced the Distribution Matching Distillation (DMD) technique used for accelerating video diffusion models?
Explore how RealWonder's flow-conditioned noise warping approach compares to other motion-control methods like ControlNet for video.

Contents

[arXiv 2026] RealWonder: Bridging the Gap Between 3D Physics and Real-Time Video Generation

1. TL;DR

2. The "Action-Visual" Mismatch

3. Methodology: Physics as the Universal Intermediate

3.1. 1. 3D Scene Reconstruction

3.2. 2. The Physics Bridge

3.3. 3. Real-Time Distilled Generator

4. Experimental Performance

5. Deep Insights: Why It Works

6. Conclusion & Future Work