WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[2026] ABot-PhysWorld: Aligning Video Generation with the Laws of Physics for Robotics
总结
问题
方法
结果
要点
摘要

ABot-PhysWorld is a 14B Diffusion Transformer-based world foundation model designed for robotic manipulation. It introduces a novel physics-alignment framework using Diffusion-DPO and spatial action injection, achieving state-of-the-art performance on the PBench and the newly proposed EZSbench for physically plausible video generation.

TL;DR

ABot-PhysWorld is a 14B Diffusion Transformer (DiT) designed to be a "World Foundation Model" that actually understands physics. Unlike general-purpose video models that produce "hallucinated" dynamics like objects passing through each other, ABot-PhysWorld uses a specialized Diffusion-DPO alignment and Parallel Spatial Action Injection to ensure that generated robot manipulations are both visually stunning and physically feasible.

Positioning: This work moves beyond "visual realism" to "physical plausibility," establishing a new SOTA on the PBench and the new EZSbench for embodied AI.

The "Hallucination" Problem in Physical AI

Current SOTA video models like Sora or Veo 3.1 are masterpieces of visual synthesis, yet they are "physics-blind." Because they are trained on massive internet datasets via maximum likelihood estimation (MLE), they treat a video of a robot arm passing through a table as just another sequence of pixels.

The authors identify two key gaps:

  1. Data Mismatch: General datasets lack the "causal" signals of friction, mass, and collision response inherent in robotic manipulation.
  2. Objective Mismatch: Likelihood-based training doesn't penalize unphysical behavior—it just tries to match the pixel distribution.

Methodology: Engineering a Physics-Aware Brain

1. Data Curation & Physics-Aware Captioning

The team curated 3 million clips from major embodied datasets (Agibot, OXE, etc.). Crucially, they moved beyond simple "robot picks up apple" captions. Using a multi-level annotation system, they describe macroscopic intent, microscopic trajectories, and causal physical modeling (e.g., "the apple drops due to gravity after the gripper opens").

2. Physical Preference Alignment (Diffusion-DPO)

To fix unphysical generations, the authors adopted Direct Preference Optimization (DPO) for diffusion.

  • The Proposer: A Qwen3-VL 32B model analyzes a starting frame and generates a "physical checklist" (e.g., "Does the gripper penetrate the object?").
  • The Scorer: A Gemini 3 Pro model evaluates multiple candidate videos against this checklist to identify "winners" (physically sound) and "losers" (physically broken).
  • The Training: The 14B DiT is then fine-tuned using LoRA to increase the probability of generating "winner" trajectories.

Model Architecture Figure: The Two-Stage Training Pipeline—from SFT to Physics-Aligned DPO.

3. Action-Conditioned Control

To make the model a true "Action-to-Video" (A2V) generator, the authors pro ject 7D robotic actions onto 2D Action Maps. These maps are fed into the model through Parallel Context Blocks. By freezing the main DiT and only training these side-blocks, the model learns to follow a specific controller without losing its pre-trained understanding of the world.

Action Injection Figure: The Parallel Context Block architecture for cross-embodiment action injection.

Experiments: Surpassing the Benchmarks

The model was tested on PBench and a new EZSbench (Embodied Zero-Shot Benchmark). EZSbench is particularly difficult because it pairs unseen robots with unseen tasks and environments.

| Model | Domain Score (Physics) | Quality Score (Visual) | | :--- | :---: | :---: | | Sora v2 Pro | 0.7626 | 0.7679 | | Veo 3.1 | 0.8350 | 0.7740 | | ABot-PhysWorld (Ours) | 0.9306 | 0.7676 |

Key Finding: While Sora and Veo have slightly higher "Aesthetic" scores, ABot-PhysWorld crushes them in "Physical Domain" scores. Qualitative results show that while other models exhibit "magnetic grasping" or "phantom objects," ABot-PhysWorld maintains rigid object geometry and realistic contact dynamics.

Qualitative Results Figure: Comparison of physical consistency. Note how ABot-PhysWorld avoids the object penetration seen in baselines.

Critical Insight & Future Outlook

ABot-PhysWorld proves that Post-Training Alignment (DPO) is as critical for World Models as it was for LLMs. Visual "Likelihood" is not enough for Embodied AI; we need "Physical Preference."

Limitations: The model currently operates on fixed viewpoints. The next frontier is Multi-view Consistency and scaling to even more complex, deformable object manipulations (like folding clothes or pouring liquids) where physical laws are even harder to model.

Takeaway for Practitioners: If you are building a simulator for VLA policies, stop relying on general T2V models. Structured action injection and DPO-based physics grounding are the required "ingredients" for a reliable world model.

发现相似论文

试试这些示例

  • Search for recent papers that utilize Direct Preference Optimization (DPO) or Reinforcement Learning from AI Feedback (RLAIF) specifically to improve physical consistency in video generation models.
  • Which paper first proposed the use of parallel context blocks or ControlNet-like structures for action-conditioned video generation in robotics, and how does this work improve upon that architecture?
  • Explore research that applies physics-aligned world models like ABot-PhysWorld to closed-loop policy learning or Model Predictive Control (MPC) in real-world robotic manipulation.
目录
[2026] ABot-PhysWorld: Aligning Video Generation with the Laws of Physics for Robotics
1. TL;DR
2. The "Hallucination" Problem in Physical AI
3. Methodology: Engineering a Physics-Aware Brain
3.1. 1. Data Curation & Physics-Aware Captioning
3.2. 2. Physical Preference Alignment (Diffusion-DPO)
3.3. 3. Action-Conditioned Control
4. Experiments: Surpassing the Benchmarks
5. Critical Insight & Future Outlook