LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

[CVPR 2024] LiLo-VLA: Breaking the Long-Horizon Barrier with Linked Object-Centric Policies

总结

问题

方法

结果

要点

摘要

LiLo-VLA is a modular robotics framework that combines classical motion planning with Vision-Language-Action (VLA) models to achieve long-horizon manipulation. By decoupling global "reaching" from local "interaction" and using an object-centric VLA, it achieves a 69% success rate on complex long-horizon tasks, significantly outperforming monolithic baseline models like Pi0.5.

Executive Summary

Long-horizon robotic manipulation—tasks like cleaning a kitchen or preparing a meal—has long been the "holy grail" of robotics. While Vision-Language-Action (VLA) models have shown promise in atomic skills, they often crumble when tasks are chained together. LiLo-VLA (Linked Local VLA) introduces a modular paradigm that separates where to go (Global Transport) from what to do (Local Interaction).

By leveraging classical motion planners for reaching and a specialized, object-centric VLA for interaction, this framework achieves zero-shot generalization to novel task sequences and extreme temporal scalability (up to 16 steps), outperforming state-of-the-art monolithic models like Pi0.5 by over 40% in success rate.

The Problem: Why End-to-End VLAs Fail at Depth

Current VLA paradigms face two fundamental bottlenecks:

Combinatorial Explosion: Training a model to handle every possible sequence of actions requires an astronomical amount of demonstration data.
Cascading Failures: End-to-end models often overfit to global visual features. A slight misalignment in step 1 leads to a failure in step 2, with no inherent mechanism to "reset" or recover.

Most models struggle with Observation Space Shift (OSS)—where irrelevant background changes (like moving a mug in the background) confuse the policy's attention on the primary task.

Methodology: Decoupling for Robustness

LiLo-VLA solves these issues through a dual-module architecture:

1. The Reaching Module (Global Transport)

Instead of forcing the neural network to learn basic geometric navigation, LiLo-VLA uses MPLib (a motion planning library). It calculates collision-free paths to a target object's vicinity. To bridge the gap between planning and execution, the system introduces initial state perturbation during training, teaching the VLA to handle the slight inaccuracies of a physical planner.

2. The Interaction Module (Object-Centric VLA)

Once the arm is in position, the "Interaction Module" takes over. Key innovations here include:

Wrist-Only Perception: Deliberately ignoring third-person cameras to avoid global visual distractions.
Visual Masking & Random Erasing: Non-target objects are masked out, forcing the policy to focus exclusively on the object of interest.

Model Architecture Fig 1: The LiLo-VLA architecture decoupling transport and interaction phases.

3. Closed-Loop Recovery

If a skill fails (e.g., the robot misses a grasp), the system doesn't just stop. It triggers a recovery loop: the Reaching Module resets the arm to the "approach pose," and the Interaction Module retries the skill. If an object is dropped, the system semantically backtracks to the last "Pick" action.

Experimental Results: Pushing the Limits

The researchers tested LiLo-VLA on LIBERO-Long++ and a custom Ultra-Long suite.

Zero-Shot Compositionality: When skill orders were permuted (Variant tasks), monolithic models like Pi0.5 collapsed to a 0% success rate because they "memorized" sequences. LiLo-VLA maintained an 85% success rate.
Extreme Scalability: In a 16-step "Table Organization" task, LiLo-VLA achieved a 43-53% success rate, while baselines failed to complete even the first few steps correctly.

Success Rate Table Table 1: Quantitative comparison showing LiLo-VLA's dominance in long-horizon tasks.

Real-World Validation

Deployed on a Franka Emika Panda, LiLo-VLA mastered 8 distinct tasks. Even when the sequence was changed or the layout was cluttered with distractors, the 85% average success rate proved the method's real-world viability.

Real World Results Fig 2: Real-world rollout showing successful multi-step manipulation with distractors.

Critical Insights & Conclusion

LiLo-VLA demonstrates that modularity is a feature, not a bug. By specializing the VLA's "receptive field" to the object level and offloading the "easy" math of path planning to classical algorithms, we can build agents that:

Are Data Efficient: Learning $N$ skills requires $O (N)$ data, not $O (N!)$ sequence variations.
Are Robust: Recovery mechanisms prevent a single error from ending a task.

Limitations: The system still relies on external perception models (like FoundationPose) for 6D poses. If the pose estimator fails due to extreme occlusion or transparency, the Reaching Module will target the wrong location. Future work on Active Perception—where the robot moves to find a better view—will be the next frontier for this framework.

发现相似论文

试试这些示例

Search for recent papers that integrate classical motion planners with Vision-Language-Action (VLA) models for long-horizon robot manipulation.
Which study first identified the "Observation Space Shift" (OSS) problem in robot learning, and how does LiLo-VLA's wrist-only approach compare to other mitigation strategies?
Explore research that applies object-centric masking and random erasing data augmentation to improve the robustness of transformer-based visuomotor policies.

[CVPR 2024] LiLo-VLA: Breaking the Long-Horizon Barrier with Linked Object-Centric Policies

1. Executive Summary

2. The Problem: Why End-to-End VLAs Fail at Depth

3. Methodology: Decoupling for Robustness

3.1. 1. The Reaching Module (Global Transport)

3.2. 2. The Interaction Module (Object-Centric VLA)

3.3. 3. Closed-Loop Recovery

4. Experimental Results: Pushing the Limits

4.1. Real-World Validation

5. Critical Insights & Conclusion