When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

[RSS 2025] When to Act, Ask, or Learn: Taming Overconfident VLMs for Robust Robot Steering

总结

问题

方法

结果

要点

摘要

The paper introduces Uncertainty-Aware Policy Steering (UPS), a framework that calibrates Vision-Language Model (VLM) verifiers to adapt robot behaviors at deployment time. By integrating conformal prediction and a Bayesian intent model, UPS enables robots to decide whether to execute an action, ask for linguistic clarification, or request human intervention for policy retraining.

TL;DR

Deep generative policies (like Diffusion Policy) provide robots with diverse skills, but they don't always know when they are out of their depth. Uncertainty-aware Policy Steering (UPS) is a new framework that uses calibrated Vision-Language Models (VLMs) to act as a "brain" that monitors the "body." It doesn't just pick an action; it statistically decides when to act (high confidence), ask (ambiguous instructions), or learn (when the policy is physically incapable).

The Problem: The Overconfidence of "Silicon Brains"

Modern robotics often uses a "verify-and-steer" approach: a base policy generates multiple possible action samples, and a VLM selects the best one based on a text prompt. However, VLMs suffer from agreement bias and overconfidence.

Semantic Ambiguity: If a user says "place the cup in a bin" (and there are two bins), an uncalibrated VLM might confidently pick the wrong one.
Physical Incapability: If the robot's low-level policy simply doesn't know how to perform a task, the VLM might still force-pick the "least bad" (but still failing) sample instead of admitting defeat.

Methodology: UPS – The Triple Threat to Uncertainty

1. Interleaved Imagination & Narration

To let the VLM "see" the future, UPS uses a Latent World Model (Dreamer-v3). It interleaves action chunks from the policy with the world model's predictions to generate long-horizon "mental videos." These videos are سپس translated into textual narrations (e.g., "The robot places the cup in the left bin").

Outcome Prediction & Narration

2. Bayesian Intent Factorization

Instead of asking a VLM "Is this action good?", UPS factorizes the problem:

P(θ|L): What are the possible hidden intents behind this vague instruction? (e.g., "User might be left-handed or right-handed").
P(y|ℓ, θ): Given an intent, how likely is this action to succeed? This prevents the VLM from collapsing onto a single, likely-wrong answer.

3. Conformal Prediction (CP) for Statistical Safety

UPS applies Conformal Prediction to create a "prediction set" of actions. If the set contains one action, the robot acts. If it contains multiple, it asks a clarifying question. If it contains a special "None of the Above" token, it triggers Residual Policy Learning to collect human help and update the model.

Experiments & Results: Efficiency in Action

The authors tested UPS on a Franka Panda robot in a "cup sorting" task.

Higher Accuracy: In ambiguous tasks, UPS success rate reached 85%, a 30% jump over uncalibrated steering (Forewarn).
Lower Human Cost: UPS only asks for physical help when it is truly incapable. Its intervention rate was nearly 3x lower than traditional DAgger methods.

Uncertainty Quantification Results

Detailed Insight: The Value of "None of the Above"

The most striking contribution is the robot's ability to recognize incapability. By including a "none" option in the calibrated set, the system bridges the gap between high-level reasoning and low-level motor control. When the robot says, "I can't find a single way to do this," it initiates a residual learning loop that fine-tunes the policy without "catastrophic forgetting" of its original skills.

Conclusion & Future Outlook

UPS represents a shift from "black-box" execution to calibrated autonomy. By providing a mathematical framework for a robot to say "I'm not sure" or "I need more training," we move closer to robots that can safely operate in complex, unpredictable human environments.

Limitations: The system currently assumes the world model's "imaginations" are accurate. Future iterations will likely need to account for "imagination uncertainty" to handle even more chaotic real-world physics.

For more technical details, check out the project page: https://jessie-yuan.github.io/ups/

发现相似论文

试试这些示例

Search for recent research on "test-time compute" or "policy steering" in robotics that utilizes Vision-Language Models as verifiers beyond the Forewarn or RoboMonkey frameworks.
Which paper first proposed the application of Conformal Prediction for Large Language Model (LLM) planner calibration, such as KnowNo, and how does the UPS score function improve upon its non-conformity logic?
Find studies that integrate Latent World Models (like Dreamer-v3) with Residual Policy Learning for continual robot skill acquisition in multi-modal environments.

[RSS 2025] When to Act, Ask, or Learn: Taming Overconfident VLMs for Robust Robot Steering

1. TL;DR

2. The Problem: The Overconfidence of "Silicon Brains"

3. Methodology: UPS – The Triple Threat to Uncertainty

3.1. 1. Interleaved Imagination & Narration

3.2. 2. Bayesian Intent Factorization

3.3. 3. Conformal Prediction (CP) for Statistical Safety

4. Experiments & Results: Efficiency in Action

5. Detailed Insight: The Value of "None of the Above"

6. Conclusion & Future Outlook