WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[CoRL 2025] ROBOMETER: Scaling Reward Models by Learning from Failure
Summary
Problem
Method
Results
Takeaways
Abstract

ROBOMETER is a scalable general-purpose robotic reward model that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Trained on the massive RBM-1M dataset, it achieves state-of-the-art performance in reward alignment and significantly improves downstream robot learning, outperforming baselines by up to 4.5x in success rates.

TL;DR

ROBOMETER is a general-purpose robotic reward model that masters the art of comparison. By training on RBM-1M (a 1-million trajectory dataset), it moves beyond simple "progress tracking" to "preference reasoning." It doesn't just know what success looks like—it understands exactly how and why a failure occurred, leading to a 32% improvement in trajectory discrimination and up to 4.5x better policy success rates in the real world.

The Core Insight: Why Comparisons Matter

Most robotic reward models are trained like students memorizing a textbook: they look at expert demonstrations and try to assign a number from 0 to 1 to every frame. This works for experts, but what happens when the robot trips, drops an object, or stalls? For a failure, is the "progress" 0.2 or 0.5? It’s ambiguous and hard to label.

The authors of ROBOMETER realize that while absolute numbers are hard, comparisons are easy. Even if we don't know the exact score of two failed attempts, we can usually tell which one was "less bad." By combining intra-trajectory progress with inter-trajectory preferences, ROBOMETER creates a calibrated global scale that spans experts, suboptimals, and total failures.

Methodology: The Dual-Objective Architecture

ROBOMETER leverages the Qwen-3-VL-4B Vision-Language Model as its backbone. The architecture is ingeniously modified to handle two videos simultaneously during training to facilitate direct comparisons.

1. The Dual Loss Function

  • Progress Loss ($\mathcal{L}_{prog}$): Anchors the reward magnitude on expert data using a categorical (C51-style) distribution over bins.
  • Preference Loss ($\mathcal{L}_{pref}$): A binary classifier that looks at two trajectories and decides which one is better. This allows the model to ingest over 1 million trajectories, including those without explicit reward labels.

2. Strategic Data Augmentation

To ensure the model understands "what not to do," the team used three sampling strategies:

  • Different Expertise: Comparing an expert demo against a known failure.
  • Instruction Negatives: Comparing "picking a red bowl" vs "picking a blue bowl" to enforce language grounding.
  • Video Rewind: Artificially creating failures by reversing segments of expert videos, forcing the model to penalize "undoing" progress.

Model Architecture In the diagram above, you can see how the model processes two trajectory videos separated by a split token to produce both dense rewards and a global preference.

Experimental Battleground: SOTA Crushing Results

The model was tested on RBM-EVAL-OOD, a set of 976 trajectories across 6 robot embodiments that were completely unseen during training.

| Metric | Baselines (Avg) | ROBOMETER | Improvement | | :--- | :--- | :--- | :--- | | Value Order Correlation | 0.82 | 0.95 | High Alignment | | Kendall-τa (Ranking) | 0.47 | 0.66 | +40% relative | | Failure Detection (F1) | 0.33 | 0.81 | Massive Gap |

Downstream Impact: Teaching the Robot

ROBOMETER isn't just a passive observer; it's a coach. When plugged into Online RL, it improved a base policy's success rate from 20% to 85% in under 45 minutes. In Data Filtering, it was used to find the best "play" data to train a policy, outperforming standard retrieval methods (like SigLIP) by 4.5x.

Experimental Results Automatic Online RL success: ROBOMETER (Blue) enables much faster and more stable learning than previous discrete reward models (RoboReward).

Critical Analysis & Takeaways

The most striking takeaway is that human cognition thrives on relative judgment, and AI models for robotics should too. By including failed data, ROBOMETER avoids the "false positive" trap where an agent thinks it's succeeding just because it's near an object.

Limitations & Future Work

  • Temporal Resolution: Currently, the model subsamples only ~8 frames per trajectory. This might miss high-frequency events like a momentary grasp slip.
  • Physics Blindness: As a purely visual model, it can't "feel" contact forces. Integrating tactile data or VQA-style reasoning about physical constraints is the next frontier.

Conclusion

ROBOMETER represents a significant step towards foundational reward models. By proving that we can scale reward learning using unlabeled, suboptimal, and failed data, it paves the way for robots that can learn from their own mistakes in the wild, without needing a human to label every frame.

Find Similar Papers

Try Our Examples

  • Find other recent papers that utilize trajectory-level preference learning or RLHF specifically for multi-task robotic manipulation.
  • Which paper first proposed the "video rewind" augmentation for self-supervised reward learning, and how does ROBOMETER's implementation differ?
  • Explore research that integrates VLM-based reward models into model-based reinforcement learning or world models like DreamZero.
Contents
[CoRL 2025] ROBOMETER: Scaling Reward Models by Learning from Failure
1. TL;DR
2. The Core Insight: Why Comparisons Matter
3. Methodology: The Dual-Objective Architecture
3.1. 1. The Dual Loss Function
3.2. 2. Strategic Data Augmentation
4. Experimental Battleground: SOTA Crushing Results
4.1. Downstream Impact: Teaching the Robot
5. Critical Analysis & Takeaways
5.1. Limitations & Future Work
6. Conclusion