Phantom is a physics-infused video generation framework that jointly models visual content and latent physical dynamics. By augmenting the Wan2.2-TI2V backbone with a dedicated physics branch using V-JEPA2 embeddings, it achieves SOTA performance in physical plausibility across benchmarks like VideoPhy and Physics-IQ.
TL;DR
While modern AI can generate visually stunning videos, they often fail "Physics 101"—balls stop mid-air, and liquids appear out of nowhere. Phantom fixes this by adding a "Physics Brain" (a dedicated latent branch) to the generative process. By jointly predicting how physical states evolve alongside pixels, it achieves massive gains in physical consistency (+50.4% on VideoPhy) without sacrificing the high-fidelity aesthetics of models like Wan2.2.
The "World Model" Illusion: Why Scaling Isn't Enough
We often hear that large-scale video models are becoming "world simulators." However, recent research (Motamed et al., 2025) suggests these models are mostly memorizing patterns rather than understanding rules. They are excellent at texture but mediocre at trajectory.
The root of the problem lies in the Next-Frame Prediction objective. It forces the model to minimize pixel-wise loss, which prioritizes local appearance over global physical logic. If a ball hits the floor, the model knows what a ball looks like, but it doesn't "know" it has momentum that must be conserved, leading to the unnatural stops we see in current SOTA models.
Methodology: The Dual-Branch Brain
Phantom’s core insight is that physics should be a first-class citizen in the architecture, not an afterthought.
1. The Physics Latent Space
Instead of using a rigid physics engine (which would limit the model to simple shapes), Phantom uses V-JEPA2 embeddings. These are "physics-aware" representations learned through self-supervised tasks that inherently capture object permanence and collisions.
2. Dual-Branch Flow Matching
The architecture (see Figure 2) consists of:
- Visual Branch: A frozen Wan2.2 transformer that maintains high-quality image synthesis.
- Physics Branch: A parallel transformer that predicts how the V-JEPA2 embeddings evolve.
- Cross-Modal Coupling: The two branches talk to each other via bidirectional cross-attention (Vis-Attention and Phy-Attention).
Figure 2: The architecture couples a pretrained video generator with a dedicated physics branch. Note the dual cross-attention layers that allow information exchange between visual cues and physical reasoning.
Experiments: Real-World Physics in Action
The researchers tested Phantom on "hard" physics scenarios: bouncing balls, pouring liquids, and deforming objects.
Quantitative SOTA
Phantom doesn't just look better; it measures better. In the VideoPhy benchmark, which tests Physical Commonsense (PC), Phantom jumped from the baseline's 25.2 score to 37.9, a staggering 50.4% relative improvement.
Qualitative Mastery
| Scenario | Base Model Failure | Phantom Success | | :--- | :--- | :--- | | Dropping a Ball | Ball hits the floor and abruptly loses all momentum. | Ball bounces naturally according to real-world dynamics. | | Pouring Juice | Liquid "teleports" to the glass bottom before pouring starts. | Glass stays empty until the liquid stream reaches it. | | Viscous Flow | Liquid falls into a "void" without building up layers. | Fluid builds up folds and waves, respecting viscosity. |
Figure 1: Comparison between Wan2.2-TI2V and Phantom. Phantom (right) correctly models the bouncing momentum and fluid timing that the base model misses.
Deeper Insight: Recursive Loss-Weight Scheduling
One technical hurdle the authors faced was that the Physics branch's gradients were much larger than the Visual branch's, which often crashed the training. They solved this with Recursive Loss-Weight Scheduling, a cyclic approach that prevents the physics branch from overwhelming the visual generator while still ensuring it learns the dynamics.
Conclusion & The Path to Force-Conditioning
Phantom represents a significant step toward true world models. Beyond just generating videos, the authors showed it could even handle Force-Prompting—simulating how an object should move if you apply a specific "point force" in a specific direction.
The Takeaway: If we want AI to interact with the physical world (e.g., in robotics), we can't just feed it more pixels. We need to teach it the latent grammar of the universe. Phantom provides the technical blueprint for doing exactly that.
Limitations
While Physics and Commonsense scores soared, the "Diversity" metric saw a dip. This suggests a classic trade-off: as a model becomes more "correct" (physically constrained), it loses some of the "hallucinated creative variety" found in unconstrained models. Future work will likely focus on balancing this physical grounding with creative agency.
