FASTER: Rethinking Real-Time Flow VLAs

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

FASTER: Rethinking Real-Time Flow VLAs

FASTER: Rethinking Real-Time Flow VLAs for Instant Robotic Reaction

总结

问题

方法

结果

要点

摘要

The paper introduces FASTER (Fast Action Sampling for ImmediaTE Reaction), a plug-and-play acceleration framework for flow-based Vision-Language-Action (VLA) models. It achieves a 10x reduction in the denoising latency for immediate actions by using a Horizon-Aware Schedule, enabling real-time responsiveness in dynamic tasks like table tennis without architectural changes or extra training.

TL;DR

Deploying Vision-Language-Action (VLA) models in the physical world often hits a wall: latency. While many methods focus on making robot motions "smooth," they ignore how fast the robot starts moving once an event occurs. FASTER introduces a Horizon-Aware Schedule that allows flow-based VLAs (like $π_{0.5}$ ) to generate the immediate next action in just one sampling step instead of ten, achieving a 10x acceleration for the first action and enabling high-speed tasks like table tennis on consumer GPUs.

The Perception-Execution Gap: Why Smoothness Isn't Readiness

Current VLA deployment typically uses "Action Chunking"—the model predicts a sequence of future actions (a "chunk") at once. Even with asynchronous inference (where the robot calculates the next chunk while moving), there is a fundamental bottleneck: the Constant Timestep Schedule.

Standard models require the entire multi-step denoising process (e.g., 10 iterations of an Euler solver) to finish before any action is sent to the robot. This creates a high Time to First Action (TTFA). In dynamic environments, if a ball is flying toward a robot, waiting 300ms for a "perfect chunk" means the robot has already missed its window to react.

The Core Insight: Near-Term Actions are "Straight"

The authors conducted a pilot study on the "straightness" of flow paths in VLA models. They discovered a crucial physical intuition: near-term actions are more certain. Because the immediate future is tightly constrained by the current observation, the mathematical "flow" from noise to the action is nearly a straight line.

Pilot Study: Straightness vs. Action Index Fig 1: Quantitative evidence that early actions follow straighter paths and can be estimated accurately with fewer steps.

Methodology: Horizon-Aware Scheduling (HAS)

Instead of giving every action in the 50-frame chunk the same 10 steps, FASTER uses a Horizon-Aware Schedule (HAS).

Index-Dependent Timesteps: The denoising timestep $a u$ is now a vector. The first action reaches "completion" (clean state) much earlier in the sampling process than the 50th action.
Streaming & Early Stopping: Actions are dispatched to the robot via a streaming interface the moment they are ready. If the robot only needs the next 4 actions to maintain continuity, the server can early-stop the AE (Action Expert) iterations once those 4 are finalized, skipping the computation for the distant future.

FASTER Architecture vs Baseline Fig 2: Baseline (top) requires N steps for all actions. FASTER (bottom) completes early actions in minimal steps while continuing to refine the future.

Real-World Performance: Dynamic Table Tennis

The true test of FASTER was a table tennis task where the robot must hit a fast-moving ball.

Synchronous Inference: Completly failed (too slow).
Naive Asynchronous: Often missed because the racket didn't start moving early enough to build velocity.
FASTER: On both RTX 4090 and the budget RTX 4060, FASTER allowed the robot to initiate its swing early, resulting in powerful and accurate hits.

Experimental Results Fig 3: FASTER consistently achieves higher completion scores across various tasks and hardware, particularly on resource-constrained edge devices.

Conclusion & Critical Analysis

FASTER is a "plug-and-play" solution that addresses the often-ignored metric of Response Time in VLAs.

Pros: No extra training parameters, no architecture change, and significant gains on consumer hardware.
Cons: While near-term actions are stable, very aggressive one-step sampling can marginally reduce the quality of long-horizon trajectories in highly complex, non-linear sequences.

Ultimately, this work suggests that for embodied AI, a fast, "good enough" reaction is often better than a slow, "perfect" prediction. It paves the way for generalist robot models to operate on the edge without needing server farms to hit a ping-pong ball.

发现相似论文

试试这些示例

Search for recent papers that utilize index-dependent or time-varying denoising schedules in diffusion or flow-matching policies for robotics.
How does the "Diffusion Forcing" framework differ from the "Horizon-Aware Schedule" in terms of handling sequential temporal dependencies during inference?
Explore if "Time to First Action" (TTFA) has been adopted as a standard benchmark metric in other real-time embodied AI frameworks or VLA evaluation suites.

FASTER: Rethinking Real-Time Flow VLAs for Instant Robotic Reaction

1. TL;DR

2. The Perception-Execution Gap: Why Smoothness Isn't Readiness

3. The Core Insight: Near-Term Actions are "Straight"

4. Methodology: Horizon-Aware Scheduling (HAS)

5. Real-World Performance: Dynamic Table Tennis

6. Conclusion & Critical Analysis