WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[Project Gaze] Transforming VLA Models from Passive Observers to Active Perceivers
总结
问题
方法
结果
要点
摘要

The paper introduces a gaze-regularized training framework for Vision-Language-Action (VLA) models, specifically targeting robotic manipulation. By aligning a transformer's internal attention with human-like visual patterns through KL divergence, the method achieves SOTA results on benchmarks like LIBERO-Spatial, improving success rates from 85.9% to 95.5%.

TL;DR

Researchers have introduced a gaze-regularized training framework that significantly boosts the performance of Vision-Language-Action (VLA) models. By forcing the model’s internal attention to mirror human eye-movement patterns during training, they achieved a 9.6% success rate increase on key robotic benchmarks and faster convergence, all while keeping the model architecture identical for real-time deployment.

The Problem: Passive Perception in Robotics

Most modern VLA models (like OpenVLA or Pi-0) process video frames "passively." They ingest the entire image and hope the internal Transformer layers eventually figure out which pixels matter for the task. This is fundamentally different from how humans operate. When you pick up a medicine bottle, your eyes scan for the label and track the contact point before your hand even moves.

Current models suffer from:

  • Inefficient Training: Learning "where to look" and "how to move" simultaneously is a heavy computational burden.
  • Brittleness: Without a focused "attention prior," models are easily distracted by background clutter or lighting changes.
  • Black-box Decision Making: It is often unclear why a model failed, making them hard to trust in safety-critical tasks.

Methodology: Human Gaze as a Soft Inductive Bias

The researchers' key insight is that human gaze encodes intent, planning, and execution. Instead of requiring robots to wear eye-trackers, they use a pretrained Gaze Estimation model to generate "synthetic gaze" for existing datasets.

1. The Gaze Prior Pipeline

They use the Global-Local Correlation (GLC) network to produce heatmaps that capture both momentary fixations and anticipatory shifts. These heatmaps are temporally aggregated and mapped onto the same patch grid used by the Vision Transformer (ViT) backbone.

2. Attention Alignment

During training, a new loss term is added: KL Divergence. This quantifies the distance between the model's internal cross-attention (where the language instruction "looks" at the image) and the human gaze distribution.

Model Architecture Figure: The Gaze-Regularized VLA Framework. Gaze distributions shape the internal attention during training, leaving a lightweight, gaze-free model for inference.

Experimental Results: Precision and Speed

The results on the LIBERO-Spatial benchmark—a suite requiring precise localization—were striking.

  • Success Rate: The regularized model hit 95.5%, nearly 10% higher than the baseline.
  • Convergence: Gains were visible as early as 10k-20k training steps, proving that the gaze prior acts as a "shortcut" for the model to find relevant features.
  • Robustness: When the researchers introduced sensor noise or lighting changes, the "gaze-aware" model held its performance significantly better than the standard VLA.

Performance Comparison Table: Comparison across LIBERO suites. Note the consistent delta improvement (up to 11.8% in multi-task settings).

Deep Insight: Why Soft Regularization Wins?

An interesting finding in the ablation studies was the Regularization Scale ($\lambda$). The authors discovered that "Low" regularization performed best.

Why? Because human gaze is a hint, not a law. If the model is forced to follow human gaze exactly (High $\lambda$), it loses the ability to discover machine-specific optimization patterns. Human gaze acts best as a "soft nudge" towards the right object, allowing the model to then refine its precise motor control based on the actual action outcome.

Visual Evidence: Interpretability

One of the biggest wins for this method is interpretability. In the gaze-regularized model, the attention maps are sharply focused on the objects of interest (e.g., the handle of a drawer), whereas the baseline attention is often diffuse and "noisy."

Attention Visualization Figure: The gaze-regularized model (right) shows a highly localized focus on the target object compared to the baseline (left).

Conclusion & Future Work

By bridging the gap between human cognitive patterns and robotic action, this work provides a blueprint for making VLAs more efficient and trustworthy. The fact that this is a training-only modification means it can be "retrofitted" onto almost any existing VLA architecture (proven in the paper with both Pi-0 and OpenVLA).

Future directions include integrating real-time eye-tracking from expert demonstrations to move beyond synthetic gaze and exploring how these priors help in extreme multi-modality (tactile + vision).

Technical Takeaway: Gaze-regularization allows models to internalize the "where to look" strategy, effectively offloading the perceptual search burden from the action-decoding phase.

发现相似论文

试试这些示例

  • Search for recent papers that use human gaze or visual attention as a supervisory signal to improve sample efficiency in reinforcement learning or imitation learning for robotics.
  • Trace the methodology of the Global-Local Correlation (GLC) network for gaze estimation and identify how subsequent works have applied it to generate synthetic labels for embodied AI.
  • Investigate comparative studies between training-time regularization (like gaze-alignment) and architectural modifications (like foveated vision) in multi-modal transformer models.
目录
[Project Gaze] Transforming VLA Models from Passive Observers to Active Perceivers
1. TL;DR
2. The Problem: Passive Perception in Robotics
3. Methodology: Human Gaze as a Soft Inductive Bias
3.1. 1. The Gaze Prior Pipeline
3.2. 2. Attention Alignment
4. Experimental Results: Precision and Speed
5. Deep Insight: Why Soft Regularization Wins?
6. Visual Evidence: Interpretability
7. Conclusion & Future Work