DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

[ArXiv 2026] DiffusionAnything: Bridging the Gap Between Navigation and Manipulation with Unified Diffusion

总结

问题

方法

结果

要点

摘要

DiffusionAnything is a unified image-space diffusion policy that bridges meter-scale navigation and centimeter-scale pre-grasping motion planning. By employing multi-scale FiLM conditioning and trajectory-aligned depth reasoning, it achieves state-of-the-art performance in zero-shot generalization to novel scenes from only 5 minutes of self-supervised data per task.

Executive Summary

TL;DR: DiffusionAnything is a compact, end-to-end diffusion policy that enables a single robot model to handle both wide-area navigation (meters) and precise pre-grasping (centimeters). By replacing heavy semantic reasoning with lightweight multi-scale FiLM conditioning and trajectory-aligned depth, it achieves robust zero-shot generalization while running at 10 Hz on standard onboard hardware.

Positioning: This work moves beyond the "modular vs. VLA" debate. It provides a middle ground: a unified, geometry-informed policy that is as flexible as a foundation model but as efficient as a task-specific controller.

The Core Challenge: The Scale Gap

In robotics, navigation and manipulation have historically lived in different "worlds." Navigation is about global routing and meter-scale obstacle avoidance, while manipulation demands centimeter-level dexterity.

The Modular Problem: Cascading separate models leads to "hand-off" failures and high latency.
The VLA Problem: Large models like RT-2 or GR00T require massive compute and often "hallucinate" in novel spaces because they lack explicit geometric understanding.

DiffusionAnything asks: Can we use a single Diffusion UNet to solve both, simply by telling the model the 'scale' and 'focus' of the current task?

Methodology: Context-Aware Diffusion

The authors' secret sauce is Contextual Modulation. Instead of just feeding an image into a Diffusion model, they inject a context vector $\mathbf{c}$ that includes:

Task Mode: Navigation vs. Pre-grasping.
Depth Scale: Is 1 unit in the model a meter or a centimeter?
Spatial Attention: Where should the robot look (the floor or the apple)?

Multi-Scale FiLM Architecture

The architecture uses Feature-wise Linear Modulation (FiLM) at different scales of the UNet. For navigation, the model prioritizes coarse scales to understand the global layout. For manipulation, it amplifies fine-scale features to pinpoint object edges.

Model Architecture Fig 1: The Context-aware cross-task diffusion policy. Notice how the same weights are modulated to change behavior.

Trajectory-Aligned Depth

One of the most elegant optimizations here is Trajectory-Aligned Depth. Predicting a full depth map is slow. Instead, DiffusionAnything only predicts the depth at the specific waypoints it intends to travel. This focuses the "computational budget" exactly where it matters for collision checking.

Experiments: Real-World Superiority

The model was tested on a Unitree G1 humanoid. The training was remarkably efficient: only 5 minutes of video data per task was needed, thanks to a self-supervised pipeline (AnyTraverse) that automatically labels traversability and goals.

Key Performance Metrics:

Navigation: 100% success rate in autonomous goal selection (0.29m accuracy).
Pre-grasping: 4.71 cm mean error, a 6x improvement in precision over the navigation mode using the same model weights.
Efficiency: 2.0 GB VRAM usage at 10 Hz—making it truly "onboard-ready."

Experimental Results Fig 2: Goal prediction accuracy comparison showing the model's ability to switch from coarse (navigation) to fine (manipulation) precision.

Zero-Shot Generalization

When compared to GR00T n1.6, DiffusionAnything excelled in novel scenes where the VLA model failed. The authors argue this is because DiffusionAnything learns geometric principles (where is the floor? where is the object distance?) rather than just memorizing visual patterns.

Critical Analysis & Future Outlook

Why it works: By conditioning on "depth scale," the model effectively learns a relative coordinate system that adapts to the task. It doesn't need to choose between being a "navigator" or a "manipulator"; it is a "spatial reasoner" that adjusts its resolution.

Limitations:

The model currently lacks force feedback (haptic sensing).
It relies on the quality of the AnyTraverse supervisor for initial training.

Future Impact: This research paves the way for truly generalist robots that can be deployed in a home and learn a new task (like opening a specific drawer) in under 15 minutes of observation, all while running on a single GPU.

Takeaway: If you want a robot to generalize, don't just give it more data—give it better geometric inductive biases.

发现相似论文

试试这些示例

Search for recent papers on "multi-scale FiLM conditioning" applied to cross-task robotic learning in both navigation and manipulation.
What is the theoretical origin of "trajectory-aligned depth prediction" and how does it compare to dense monocular depth estimation in real-time robotics?
Explore research that integrates the AnyTraverse framework with diffusion policies for self-supervised goal generation in off-road or unstructured environments.

[ArXiv 2026] DiffusionAnything: Bridging the Gap Between Navigation and Manipulation with Unified Diffusion

1. Executive Summary

2. The Core Challenge: The Scale Gap

3. Methodology: Context-Aware Diffusion

3.1. Multi-Scale FiLM Architecture

3.2. Trajectory-Aligned Depth

4. Experiments: Real-World Superiority

4.1. Key Performance Metrics:

4.2. Zero-Shot Generalization

5. Critical Analysis & Future Outlook