WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[CVPR 2026] PhysInOne: The "ImageNet Moment" for AI Physical World Models
Summary
Problem
Method
Results
Takeaways
Abstract

PhysInOne is a large-scale synthetic dataset designed for visual physics learning and reasoning, featuring 2 million videos across 153,810 dynamic 3D scenes. It covers 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new SOTA benchmark for physics-grounded world models.

TL;DR

PhysInOne is an unprecedented synthetic dataset containing 2 million videos designed to teach AI the laws of physics. Covering mechanics, optics, fluids, and magnetism, it provides the scale necessary to move beyond visually "hallucinated" videos toward physically grounded world simulators. By fine-tuning models on this suite, researchers achieved significant gains in motion fidelity, though complex 3D reasoning remains an open challenge.

The Motivation: Why AI Still Fails "Physics 101"

Despite the brilliance of models like Sora or Kling, they frequently fail at basic intuition: a ball might pass through a wall, or a fluid might not exhibit buoyancy. The root cause is a data scarcity: existing physics datasets are either too small or oversimplified (e.g., just cubes and spheres). To build a true "World Simulator," we need a dataset that is to physics what ImageNet was to object recognition—massive, diverse, and precisely annotated.

Methodology: Building a Universe in a Box

The creators of PhysInOne didn't just record videos; they built a multi-engine simulation pipeline to ensure "Physical Ground Truth."

1. The Simulation Stack

Instead of relying on one engine, the authors utilized a specialized trifecta:

  • Chaos Physics (UE5): For rigid body dynamics and complex mechanical constraints.
  • Taichi (MPM): For high-fidelity deformable objects and granular substances (sand, plasticine).
  • Doriflow (SPH): For realistic fluid dynamics (Newtonian and non-Newtonian).

2. Multi-Physics Complexity

Real-world scenes are rarely "single-variable." PhysInOne emulates this by combining 71 basic phenomena into 3,284 unique activities, ranging from simple gravity to triple-physics interactions (e.g., wind hitting a car rolling up a slope with friction).

Overall Pipeline Figure 1: The PhysInOne pipeline spans 71 phenomena, involving 2,231 objects across 528 backgrounds.

A New Metric: Physical Motion Fidelity (PMF)

Standard metrics like FVD evaluate how "real" a video looks to the eye, but they miss how "right" the motion is. The authors introduce PMF, a Fourier-based metric.

  • Insight: By comparing the energy spectra of a generated video vs. a reference, PMF measures motion dynamics while being invariant to simple pixel shifts or color changes.
  • Why it matters: It forces the model to be evaluated on its kinematic accuracy rather than its ability to render pretty textures.

Experimental Results: The Gap in 3D Reasoning

The paper benchmarks four critical tasks: Video Generation, Future Prediction, Property Estimation, and Motion Transfer.

Key Finding 1: Fine-tuning Works

Fine-tuning models like SVD or Wan2.2 on PhysInOne drastically increased their PMF scores (e.g., Wan2.2 improved from 2.04 to 2.97). This suggests that models can inherit physical priors if the data is representative enough.

Experimental Results Table 1: Quantitative improvements in PMF and Human Ratings across different video models.

Key Finding 2: The "3D Novel View" Wall

While models could predict future frames for "seen" camera angles, their performance plummeted for novel viewpoints. This reveals that current AI doesn't yet have a 3D-consistent internal model of physics—it is still largely performing 2D pattern matching.

Future Frame Prediction Figure 2: Qualitative examples of long-term future frame prediction. While trained views are stable, complex dynamics still lead to artifacts.

Critical Analysis & Conclusion

PhysInOne is a massive leap forward. By providing 2 million videos with exact physical parameters, it removes the "data bottleneck" for world models.

Takeaways for Practitioners:

  1. Synthetic Data is Key: High-quality physics data cannot be easily scraped from the web (which is full of "magic" and editing). Simulation is the only way to get true supervision.
  2. Multimodal Physics: The inclusion of text descriptions (e.g., "collision displaced the box") allows Large Multimodal Models (LMMs) to link verbal descriptions to physical outcomes.
  3. The Road Ahead: The failure of current models on "Property Estimation" (guessing the Young's modulus or viscosity from video) shows that AI is still far from being a "virtual scientist."

PhysInOne establishes the benchmark; now it's up to the community to build the architectures that can actually solve it.

Find Similar Papers

Try Our Examples

  • Search for recent papers published after 2024 that utilize synthetic datasets to improve the physical commonsense of Video Generation models.
  • Which was the first research paper to propose the Material Point Method (MPM) for differentiable physics simulation, and how does this paper build upon that foundation?
  • Find research studies that apply the Physical Motion Fidelity (PMF) metric or similar Fourier-based evaluations to assess video consistency in autonomous driving or robotics.
Contents
[CVPR 2026] PhysInOne: The "ImageNet Moment" for AI Physical World Models
1. TL;DR
2. The Motivation: Why AI Still Fails "Physics 101"
3. Methodology: Building a Universe in a Box
3.1. 1. The Simulation Stack
3.2. 2. Multi-Physics Complexity
4. A New Metric: Physical Motion Fidelity (PMF)
5. Experimental Results: The Gap in 3D Reasoning
5.1. Key Finding 1: Fine-tuning Works
5.2. Key Finding 2: The "3D Novel View" Wall
6. Critical Analysis & Conclusion