SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

[CVPR 2026] SaPaVe: Giving Robots a "Moving Eye" for Active Perception and Complex Manipulation

Summary

Problem

Method

Results

Takeaways

Abstract

SaPaVe is an end-to-end Vision-Language-Action (VLA) framework designed for autonomous robots to perform active perception and manipulation in cluttered, dynamic scenes. It achieves State-of-the-Art (SOTA) performance by surpassing existing models like π0 and GR00T-N1, reaching up to a 31.25% higher success rate in real-world active manipulation tasks.

TL;DR

SaPaVe (Semantic active Perception and active-View execution) is a breakthrough VLA framework that enables robots to actively move their "heads" to find objects before manipulating them. By decoupling camera control from arm actions and utilizing a new 200k-sample dataset, it solves the problem of "blind spots" in modern robotics, achieving over 30% higher success rates than leading models like π0 and GR00T-N1.

Background & Motivation: The "Fixed View" Trap

Most modern Vision-Language-Action (VLA) models act like a person wearing a neck brace—they can see what's directly in front of them, but they can't turn their heads to look for a hidden bowl or a handle tucked behind a cabinet door.

The authors identify two fatal flaws in prior work:

Semantic Gap: Turning high-level commands like "look inside the bottom drawer" into precise camera angles is difficult.
Execution Fragility: Even if a robot moves its camera, the shift in pixels confuses the manipulation policy, which was likely trained on static, perfect views.

Methodology: The Bottom-Up Rebirth

SaPaVe introduces a sophisticated architecture that treats "looking" and "doing" as separate but synchronized skills.

1. Decoupled Action Heads & Camera Adapter

Instead of forcing the robot to learn every joint movement (including the head) in one messy vector, SaPaVe splits the output. A specialized Camera Adapter (using LoRA) focuses on the "where to look" semantic mapping, while the Decoupled Action Head handles the high-DoF (26-DoF) arms.

2. Universal Spatial Knowledge Injection

To prevent the robot from getting "dizzy" when the camera moves, the authors inject 3D geometric data (depth, intrinsics) directly into the action denoising process. This ensures the robot maintains a persistent 3D understanding of the workspace, even as the pixels shift.

Figure 1: The SaPaVe Architecture featuring the Two-Stage training strategy and Decoupled Action Heads.

3. The Two-Stage Training Strategy

Stage 1 (Perception Alignment): Train the robot to "look" using the ActiveViewPose-200K dataset.
Stage 2 (Active Manipulation): Fine-tune the entire system on ActiveManip-Bench, a new benchmark designed specifically to test robots on tasks where the object is occluded or out-of-view.

Results: Outperforming the Giants

The performance gains are most visible in "Out-of-View" tasks—scenarios where the robot starts facing the wrong way.

Perception Accuracy: Despite having only 2B parameters, SaPaVe's camera control outperformed Gemini-2.5-Pro by 11.6%.
Manipulation Success: On the ActiveManip-Bench, SaPaVe achieved an average success rate of 75.2%, while traditional fixed-view VLAs like GR00T-N1 plummeted to near zero on out-of-view tasks.
Real-World Prowess: In actual hardware tests, SaPaVe maintained an 85% success rate in occluded pick-and-place tasks, vastly outperforming π0 (45%).

Performance Comparison Table 1: Real-world performance showing SaPaVe's dominance over π0 and GR00T-N1.

Deep Insight: Why It Works

The "Secret Sauce" is the bottom-up strategy. By training the camera control first (Stage 1), the model builds a "visual common sense" about where things are likely to be hidden based on language. When Stage 2 begins, the manipulation policy doesn't have to learn to see; it just has to learn to act on the now-reliable visual stream provided by the active head.

Conclusion & Future Look

SaPaVe proves that Active Perception is not just an "extra feature" but a requirement for general-purpose robotics. The introduction of ActiveViewPose-200K and ActiveManip-Bench provides the community with much-needed tools to move beyond the "near-optimal fixed view" era.

Future Outlook: The authors suggest the next step is combining this with Mobile Manipulation—giving the robot not just a moving neck, but moving legs (base) to find objects across different rooms.

Find Similar Papers

Try Our Examples

Search for recent papers that investigate the use of decoupled action spaces in Vision-Language-Action (VLA) models for multi-task robotic control.
Which study first introduced the concept of active perception in robotics, and how do current VLA-based end-to-end methods like SaPaVe modernize those classical information-gain strategies?
Explore how Universal Spatial Knowledge Injection or similar 3D geometry-aware modules have been applied to mobile manipulation or long-horizon navigation tasks.

Contents

[CVPR 2026] SaPaVe: Giving Robots a "Moving Eye" for Active Perception and Complex Manipulation

1. TL;DR

2. Background & Motivation: The "Fixed View" Trap

3. Methodology: The Bottom-Up Rebirth

3.1. 1. Decoupled Action Heads & Camera Adapter

3.2. 2. Universal Spatial Knowledge Injection

3.3. 3. The Two-Stage Training Strategy

4. Results: Outperforming the Giants

5. Deep Insight: Why It Works

6. Conclusion & Future Look