FAVLA is a force-adaptive Vision-Language-Action (VLA) model designed for contact-rich manipulation tasks, utilizing a "fast-slow" architecture to decouple semantic reasoning from reactive control. By predicting future force variations, it adaptively scales its inference frequency, outperforming baselines by 13.8% in success rate and achieving State-of-the-Art (SOTA) performance in high-precision assembly.
TL;DR
Robotic manipulation is often a game of two speeds: the "slow" deliberation of vision and language, and the "fast" reflexes required by physical contact. FAVLA (Force-Adaptive VLA) introduces a dual-speed architecture that treats high-frequency force feedback as a first-class citizen. By predicting future force volatility, the model dynamically ramps up its control frequency during critical contact moments, leading to a 13.8% boost in success rates and a significant reduction in damaging impact forces.
The "Frequency Gap" in Robotic Foundation Models
Current Vision-Language-Action (VLA) models, such as π0, face a fundamental bottleneck: sensor mismatch. Cameras typically run at 15-30 Hz, while force/torque sensors provide data at 1000 Hz. Standard pipelines downsample everything to the slowest rate. In contact-rich tasks like gear assembly or USB insertion, this creates a "blind spot" where the robot cannot react to a snag or a slip until it's too late.
Prior works treated force tokens as just another input, but the authors of FAVLA argue that semantic understanding and corrective control operate on different manifolds and time scales.
Methodology: The Fast-Slow Fusion
FAVLA’s architecture is split into two distinct yet communicating systems:
- Slow VLM Backbone: This component uses a large VLM (based on PaliGemma) to process images, language, and historical force data. It generates a KV Cache—a "semantic summary" of the scene.
- Fast Action Expert (AE): A smaller, agile transformer that consumes the KV Cache and the latest high-frequency force signals to refine action chunks.
The Force Adapter and Variance Head
Instead of simple concatenation, FAVLA uses a Force Adapter to inject force features directly into multiple layers of the Action Expert via cross-attention. Simultaneously, a Force Variance Head predicts the expected volatility of force in the near future.

Force-Adaptive Inference: Reactivity on Demand
The standout feature of FAVLA is its adaptive frequency schedule.
- Free-space motion: The system runs at a base frequency to conserve compute.
- Imminent Contact: As the variance head predicts a force spike, the AE execution frequency () increases.
This mimics the human physical intuition of slowing down and paying closer attention just as we are about to plug in a delicate connector.

Experimental Validation: Precision without Destruction
The model was tested on a dual-arm robot performing high-precision industrial tasks.
SOTA Comparison
FAVLA outperformed strong baselines like ForceVLA and TA-VLA across the board. In the Gear Assembly task—a maneuver requiring millimeter-level precision—FAVLA achieved a 93.3% success rate.
Force Regulation
One of the most critical metrics in industrial settings is the Peak Contact Force. Excessive force triggers safety stops or breaks parts. FAVLA reduced peak forces in Gear Assembly from 12.0N (standard π0) down to 7.7N, proving that high-rate closed-loop control allows for a "gentler" touch.

Critical Insight & Future Outlook
FAVLA’s success proves that frequency awareness is as important as modality awareness. By allowing the model to "stare harder" (increase frequency) at the force data when it expects a collision, it bridges the gap between high-level reasoning and the messy reality of physics.
Limitations: Currently, the model relies on a specific force/torque sensor setup. Future iterations could explore whether "pseudo-force" signals derived from vision-based tactile sensors could achieve similar results, further lowering the hardware barrier for industrial deployment.
Summary of Takeaways
- Dynamic Frequency: Fixed-rate execution is a bottleneck for reactive robotics.
- Temporal Decoupling: Isolate expensive VLM reasoning from cheap, high-rate AE corrections.
- Force Safety: Higher reactivity leads directly to lower peak impact forces, essential for industrial safety.
