ROBOMETER is a scalable general-purpose robotic reward model that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Trained on the massive RBM-1M dataset, it achieves state-of-the-art performance in reward alignment and significantly improves downstream robot learning, outperforming baselines by up to 4.5x in success rates.
TL;DR
ROBOMETER is a general-purpose robotic reward model that masters the art of comparison. By training on RBM-1M (a 1-million trajectory dataset), it moves beyond simple "progress tracking" to "preference reasoning." It doesn't just know what success looks like—it understands exactly how and why a failure occurred, leading to a 32% improvement in trajectory discrimination and up to 4.5x better policy success rates in the real world.
The Core Insight: Why Comparisons Matter
Most robotic reward models are trained like students memorizing a textbook: they look at expert demonstrations and try to assign a number from 0 to 1 to every frame. This works for experts, but what happens when the robot trips, drops an object, or stalls? For a failure, is the "progress" 0.2 or 0.5? It’s ambiguous and hard to label.
The authors of ROBOMETER realize that while absolute numbers are hard, comparisons are easy. Even if we don't know the exact score of two failed attempts, we can usually tell which one was "less bad." By combining intra-trajectory progress with inter-trajectory preferences, ROBOMETER creates a calibrated global scale that spans experts, suboptimals, and total failures.
Methodology: The Dual-Objective Architecture
ROBOMETER leverages the Qwen-3-VL-4B Vision-Language Model as its backbone. The architecture is ingeniously modified to handle two videos simultaneously during training to facilitate direct comparisons.
1. The Dual Loss Function
- Progress Loss ($\mathcal{L}_{prog}$): Anchors the reward magnitude on expert data using a categorical (C51-style) distribution over bins.
- Preference Loss ($\mathcal{L}_{pref}$): A binary classifier that looks at two trajectories and decides which one is better. This allows the model to ingest over 1 million trajectories, including those without explicit reward labels.
2. Strategic Data Augmentation
To ensure the model understands "what not to do," the team used three sampling strategies:
- Different Expertise: Comparing an expert demo against a known failure.
- Instruction Negatives: Comparing "picking a red bowl" vs "picking a blue bowl" to enforce language grounding.
- Video Rewind: Artificially creating failures by reversing segments of expert videos, forcing the model to penalize "undoing" progress.
In the diagram above, you can see how the model processes two trajectory videos separated by a split token to produce both dense rewards and a global preference.
Experimental Battleground: SOTA Crushing Results
The model was tested on RBM-EVAL-OOD, a set of 976 trajectories across 6 robot embodiments that were completely unseen during training.
| Metric | Baselines (Avg) | ROBOMETER | Improvement | | :--- | :--- | :--- | :--- | | Value Order Correlation | 0.82 | 0.95 | High Alignment | | Kendall-τa (Ranking) | 0.47 | 0.66 | +40% relative | | Failure Detection (F1) | 0.33 | 0.81 | Massive Gap |
Downstream Impact: Teaching the Robot
ROBOMETER isn't just a passive observer; it's a coach. When plugged into Online RL, it improved a base policy's success rate from 20% to 85% in under 45 minutes. In Data Filtering, it was used to find the best "play" data to train a policy, outperforming standard retrieval methods (like SigLIP) by 4.5x.
Automatic Online RL success: ROBOMETER (Blue) enables much faster and more stable learning than previous discrete reward models (RoboReward).
Critical Analysis & Takeaways
The most striking takeaway is that human cognition thrives on relative judgment, and AI models for robotics should too. By including failed data, ROBOMETER avoids the "false positive" trap where an agent thinks it's succeeding just because it's near an object.
Limitations & Future Work
- Temporal Resolution: Currently, the model subsamples only ~8 frames per trajectory. This might miss high-frequency events like a momentary grasp slip.
- Physics Blindness: As a purely visual model, it can't "feel" contact forces. Integrating tactile data or VQA-style reasoning about physical constraints is the next frontier.
Conclusion
ROBOMETER represents a significant step towards foundational reward models. By proving that we can scale reward learning using unlabeled, suboptimal, and failed data, it paves the way for robots that can learn from their own mistakes in the wild, without needing a human to label every frame.
