FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

FAVLA: Synchronizing Semantic Logic with Physical Reflexes via Force-Adaptive VLA

总结

问题

方法

结果

要点

摘要

FAVLA is a force-adaptive Vision-Language-Action (VLA) model designed for contact-rich manipulation tasks, utilizing a "fast-slow" architecture to decouple semantic reasoning from reactive control. By predicting future force variations, it adaptively scales its inference frequency, outperforming baselines by 13.8% in success rate and achieving State-of-the-Art (SOTA) performance in high-precision assembly.

TL;DR

Robotic manipulation is often a game of two speeds: the "slow" deliberation of vision and language, and the "fast" reflexes required by physical contact. FAVLA (Force-Adaptive VLA) introduces a dual-speed architecture that treats high-frequency force feedback as a first-class citizen. By predicting future force volatility, the model dynamically ramps up its control frequency during critical contact moments, leading to a 13.8% boost in success rates and a significant reduction in damaging impact forces.

The "Frequency Gap" in Robotic Foundation Models

Current Vision-Language-Action (VLA) models, such as π0, face a fundamental bottleneck: sensor mismatch. Cameras typically run at 15-30 Hz, while force/torque sensors provide data at 1000 Hz. Standard pipelines downsample everything to the slowest rate. In contact-rich tasks like gear assembly or USB insertion, this creates a "blind spot" where the robot cannot react to a snag or a slip until it's too late.

Prior works treated force tokens as just another input, but the authors of FAVLA argue that semantic understanding and corrective control operate on different manifolds and time scales.

Methodology: The Fast-Slow Fusion

FAVLA’s architecture is split into two distinct yet communicating systems:

Slow VLM Backbone: This component uses a large VLM (based on PaliGemma) to process images, language, and historical force data. It generates a KV Cache—a "semantic summary" of the scene.
Fast Action Expert (AE): A smaller, agile transformer that consumes the KV Cache and the latest high-frequency force signals to refine action chunks.

The Force Adapter and Variance Head

Instead of simple concatenation, FAVLA uses a Force Adapter to inject force features directly into multiple layers of the Action Expert via cross-attention. Simultaneously, a Force Variance Head predicts the expected volatility of force in the near future.

Overall Architecture

Force-Adaptive Inference: Reactivity on Demand

The standout feature of FAVLA is its adaptive frequency schedule.

Free-space motion: The system runs at a base frequency to conserve compute.
Imminent Contact: As the variance head predicts a force spike, the AE execution frequency ( $n_{t}$ ) increases.

This mimics the human physical intuition of slowing down and paying closer attention just as we are about to plug in a delicate connector.

Inference Strategy

Experimental Validation: Precision without Destruction

The model was tested on a dual-arm robot performing high-precision industrial tasks.

SOTA Comparison

FAVLA outperformed strong baselines like ForceVLA and TA-VLA across the board. In the Gear Assembly task—a maneuver requiring millimeter-level precision—FAVLA achieved a 93.3% success rate.

Force Regulation

One of the most critical metrics in industrial settings is the Peak Contact Force. Excessive force triggers safety stops or breaks parts. FAVLA reduced peak forces in Gear Assembly from 12.0N (standard π0) down to 7.7N, proving that high-rate closed-loop control allows for a "gentler" touch.

Results Comparison

Critical Insight & Future Outlook

FAVLA’s success proves that frequency awareness is as important as modality awareness. By allowing the model to "stare harder" (increase frequency) at the force data when it expects a collision, it bridges the gap between high-level reasoning and the messy reality of physics.

Limitations: Currently, the model relies on a specific force/torque sensor setup. Future iterations could explore whether "pseudo-force" signals derived from vision-based tactile sensors could achieve similar results, further lowering the hardware barrier for industrial deployment.

Summary of Takeaways

Dynamic Frequency: Fixed-rate execution is a bottleneck for reactive robotics.
Temporal Decoupling: Isolate expensive VLM reasoning from cheap, high-rate AE corrections.
Force Safety: Higher reactivity leads directly to lower peak impact forces, essential for industrial safety.

发现相似论文

试试这些示例

Search for recent papers on hierarchical Vision-Language-Action (VLA) models that utilize multi-frequency or asynchronous sensing for robotic control.
Which original studies introduced the "fast-slow" dual-system concept in robotics, and how does FAVLA's force-adaptive scheduling specifically improve upon fixed-frequency fast-slow architectures?
Examine research applying adaptive inference frequency or dynamic tokenization strategies to State-Space Models (SSM) or Diffusion-based policies in the context of tactile and force feedback.

FAVLA: Synchronizing Semantic Logic with Physical Reflexes via Force-Adaptive VLA

1. TL;DR

2. The "Frequency Gap" in Robotic Foundation Models

3. Methodology: The Fast-Slow Fusion

3.1. The Force Adapter and Variance Head

4. Force-Adaptive Inference: Reactivity on Demand

5. Experimental Validation: Precision without Destruction

5.1. SOTA Comparison

5.2. Force Regulation

6. Critical Insight & Future Outlook

6.1. Summary of Takeaways