WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[Realtime-VLA V2] Learning to Run VLAs: Fast, Smooth, and Accurate
总结
问题
方法
结果
要点
摘要

Realtime-VLA V2 is a system-level framework designed to accelerate Vision-Language-Action (VLA) models for robotic tasks. It introduces a suite of calibration, temporal optimization, and learning-based speed adaptation techniques to achieve "faster-than-demonstration" execution, reaching speeds comparable to human operation.

TL;DR

Running a VLA model fast on paper is a solved problem; running it fast on a physical robot without it shaking itself to pieces or failing a task is the real challenge. Realtime-VLA V2 identifies that the gap between demonstration speed and hardware limits isn't just a "model" problem—it's a system synchronization and trajectory shaping problem. By combining precise latency calibration, Quadratic Programming (QP) for smoothing, and a human-taught speed adaptation model, the authors reach near-human execution speeds in complex tasks.

Background: The Speed Bottleneck

In the world of imitation learning, we are limited by our teachers. Human teleoperators move slowly due to cognitive load and feedback lag. When we try to "speed up" the resulting VLA policies, two things happen:

  1. Hardware Instability: High-acceleration commands cause jerky motions in low-stiffness arms (like QDD motors), leading to visual blur and mechanical failure.
  2. Control Lag: The robot's actual position lags behind the commanded position (the tmotion gap), causing the VLA to see a world that doesn't match its previous action tokens.

1. Finding the "Missing Milliseconds" (Calibration)

The authors argue that the VLA model assumes an ideal world where "seeing is doing." In reality, there is a cascade of delays: camera readout, exposure time, proprioception lag, and tracking lag.

System Delay Model

By using a high-FPS camera to record a "time-position plot" of a robot swaying in front of a screen (the lower-right track bar setup above), they precisely measure these delays (some as high as 150ms for motion). They then use Pre-amplification to send "drastic" commands that force the lagging robot to actually follow the intended path.

2. The Core Architecture: Smooth is Fast

The workflow separates the Server (VLA inference) from the Client (Robot control).

Temporal Optimization (The Server Side)

Instead of just playing back the VLA's action chunk faster, they solve an optimization problem. It minimizes the change in "step duration" while penalizing high acceleration. This ensures that the global speed is high, but no single joint is asked to "snap" instantaneously.

Spatial Optimization (The Client Side)

Running at the hardware's servo frequency (e.g., 500Hz), a local MPC (Model Predictive Control) planner ensures the robot stays on the smoothed path.

Pre-amplification Effect Figure: The red "Command" is intentionally exaggerated so the blue "Actual" tracks the green "Model Target" perfectly despite hardware lag.

3. Learning the "Throttle": Human-in-the-Loop

Not every part of a task should be fast. Folding a shirt's sleeves or inserting a PCB into a 0.2mm fixture requires slowing down.

The authors found that predicting failure rates via a Q-function was prone to overfitting. Instead, they used a Human-in-the-loop "Throttle" method. A human watches the VLA run and dynamically adjusts the speed. A regression model then learns this "speed profile," allowing the robot to automatically blast through movements in open space and tip-toe during delicate contact.

Human Speed Adaptation

4. Are We at the Hardware Limit? (Upperbound Analysis)

The paper introduces a "Roofline" style analysis for robotics. By segmenting trajectories into Motion Bounded (limited by motor physics) and Control Bounded (limited by the VLA's prediction horizon and system lag), they show that for tasks like shirt folding, the robot is already hitting the hardware jerk and acceleration limits.

Conclusion & Insights

The take-away from Realtime-VLA V2 is that Expertise consists of knowing when to be fast and the system-level engineering to make it smooth. By treating the robot as a dynamic system rather than a static output of a neural network, they successfully bridged the gap between slow demonstrations and high-performance real-world deployment.

Limitations

  • Data Collection: The human-in-the-loop approach still requires expert supervision for data collection.
  • Hardware Specificity: Calibrations are specific to the Airbot Play/RealSense setup and must be redone for different hardware stacks.

发现相似论文

试试这些示例

  • Find recent papers on adaptive velocity planning or "faster-than-demonstration" imitation learning that utilize reinforcement learning for online speed optimization.
  • What are the primary methods for compensating for control loop latency and observation delays in Vision-Language-Action (VLA) systems besides constant-time offsets?
  • Explore how the "Roofline Model" concept has been applied to robotic throughput analysis to identify whether a system is motion-bounded or control-bounded.
目录
[Realtime-VLA V2] Learning to Run VLAs: Fast, Smooth, and Accurate
1. TL;DR
2. Background: The Speed Bottleneck
3. 1. Finding the "Missing Milliseconds" (Calibration)
4. 2. The Core Architecture: Smooth is Fast
4.1. Temporal Optimization (The Server Side)
4.2. Spatial Optimization (The Client Side)
5. 3. Learning the "Throttle": Human-in-the-Loop
6. 4. Are We at the Hardware Limit? (Upperbound Analysis)
7. Conclusion & Insights
7.1. Limitations