DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

[ArXiv 2025] DiT4DiT: Transforming Video Generation into the Ultimate Foundation for Robot Control

Summary

Problem

Method

Results

Takeaways

Abstract

DiT4DiT is an end-to-end Video-Action Model (VAM) that couples a Video Diffusion Transformer with an Action Diffusion Transformer using a dual flow-matching objective. It achieves SOTA performance on robot manipulation benchmarks like LIBERO (98.6% success) and RoboCasa (50.8% success) by treating video generation as a scaling proxy for physical dynamics.

TL;DR

DiT4DiT shifts the paradigm of robotic learning from "Language-Action" to "Video-Action." By coupling a Video Diffusion Transformer (Video DiT) with an Action DiT, the model learns the "laws of physics" through video generation and uses that intuition to guide robot movement. The result? A 10x boost in sample efficiency and state-of-the-art success rates on LIBERO and RoboCasa.

Background: The "Physical Blindness" of Static VLAs

Most current Vision-Language-Action (VLA) models (like RT-2 or OpenVLA) are built on top of Large Language Models (LLMs) trained on static internet images. While these models are "smart" in a semantic sense, they are "physically blind." They don't inherently understand how a cup moves through space or how a drawer slides open. All that temporal and physical knowledge must be learned from scratch using expensive, labeled robot trajectory data.

DiT4DiT poses a provocative question: What if we used video generation as the foundation? Since video models learn to predict the next frame, they must implicitly understand gravity, collisions, and continuity.

Methodology: The Power of Dual Flow-Matching

The core innovation is the Dual-DiT architecture. Instead of just predicting a future frame and then deciding what to do, DiT4DiT extracts "features from the process of imagining the future."

1. The Tri-Timestep Scheme

Training a joint model is tricky because video generation needs to explore all "noise levels" to be a good generator, but an action policy needs a stable, consistent visual signal. DiT4DiT solves this with three distinct timesteps:

$a u_{v}$ (Video): Uniformly sampled to train the video generator.
$a u_{f}$ (Feature): A fixed, deterministic timestep used to extract stable features for the action model.
$a u_{a}$ (Action): Beta-distributed sampling to focus the action model on critical control phases.

Overall Architecture

2. Intermediate Denoising Features

The authors discovered that the best features for control aren't at the very end of the video generation (pixel-perfect) or at the beginning (pure noise). By "hooking" into Layer 18 of the Video DiT, the model captures a mid-level abstraction that contains enough physical context for a robot to act without being bogged down by pixel-level details.

Experimental Results: Efficiency and Zero-Shot Mastery

DiT4DiT was tested on the LIBERO and RoboCasa benchmarks, as well as a real-world Unitree G1 humanoid.

Sample Efficiency: DiT4DiT achieved better results with 1,000 trajectories than other models did with 10,000.
High Precision: In the "Arrange Flower" task (inserting a thin stem into a vase), DiT4DiT achieved a 75% success rate, while the LLM-based GR00T-N1.5 sat at only 25%.
Zero-Shot Generalization: When presented with unseen objects (e.g., swapping a plastic cup for a glass one) or changing the number of objects, DiT4DiT remained robust because it understood the spatial relationship of the task, not just the pixels.

Performance Comparison

Why It Works: Joint Training induction

The secret sauce is the Joint Training. By optimizing the video loss and action loss together, the latent space of the video model is "regularized" to be useful for actions. t-SNE visualizations show that the joint training objective induces a smooth temporal flow (Early -> Middle -> Late phases) in the feature space, which is absent when the models are trained separately.

Ablation Studies

Critical Analysis & Future Outlook

While highly effective, DiT4DiT currently runs at 6Hz on a single RTX 4090. While sufficient for many tasks, it is slower than pure VLA models (13Hz). However, because the LLM features used for language conditioning are static, they can be cached to increase speed.

The Bigger Picture: This work confirms that Video Generation is a Scalable Proxy for Robot Policy. We no longer need millions of labeled "Action" data points if we have billions of unlabeled "Video" data points. By learning how the world looks and moves, robots can finally learn how to act.

Conclusion

DiT4DiT proves that the future of generalist robots lies in unified world models. By treating "imagination" (video generation) and "execution" (action prediction) as two sides of the same coin, we move closer to robots that can truly navigate the complexity of the human world with minimal supervision.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize Diffusion Transformers (DiT) as the primary backbone for end-to-end robotic policy learning beyond traditional VLA architectures.
How does the "Dual Flow-Matching" objective in DiT4DiT compare to earlier "Inverse Dynamics" approaches that map future video latents to actions?
Explore studies investigating the specific role of intermediate diffusion features (hidden states) versus final denoised outputs for downstream discriminative or control tasks.

Contents

[ArXiv 2025] DiT4DiT: Transforming Video Generation into the Ultimate Foundation for Robot Control

1. TL;DR

2. Background: The "Physical Blindness" of Static VLAs

3. Methodology: The Power of Dual Flow-Matching

3.1. 1. The Tri-Timestep Scheme

3.2. 2. Intermediate Denoising Features

4. Experimental Results: Efficiency and Zero-Shot Mastery

5. Why It Works: Joint Training induction

6. Critical Analysis & Future Outlook

7. Conclusion