WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
Odysseus: Scaling VLMs to 100+ Turn Decision-Making via Stabilized RL
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces Odysseus, an open training framework that scales Vision-Language Models (VLMs) to handle long-horizon (100+ turns) decision-making in games like Super Mario Land. By combining a lightweight turn-level CNN critic with an adapted PPO algorithm, the authors achieve a 3x to 6x improvement in game progress over frontier models like GPT-4 and the Qwen-VL base model.

TL;DR

While Vision-Language Models (VLMs) excel at reasoning, they often fail as "agents" in complex, long-turn environments. Odysseus changes this by proving that with the right Reinforcement Learning (RL) recipe—specifically an adapted PPO with a lightweight CNN critic—VLMs can master games requiring 100+ turns. Odysseus outperforms proprietary giants like GPT-4 by over 300% in game progress while maintaining general-domain reasoning.

The "Horizon" Problem: Why VLMs Get Lost

Current AI agents are mostly "short-sighted." In standard benchmarks, tasks are finished in 20-30 steps. Scaling to 100+ steps (like a full level of Super Mario Land) introduces the Credit Assignment nightmare: which specific action among 100 turns actually led to success or failure?

Prior works tried to solve this with:

  1. Imitation Learning (SFT): Hard to scale because human data is expensive.
  2. Critic-free RL (GRPO/Reinforce++): These often fail in long-horizon games because they can't accurately estimate the value of a state without a dedicated critic.

Overview of Odysseus Performance

Methodology: The "Secret Sauce" of Odysseus

The researchers identified that you don't need a massive 8B parameter model to tell you if Mario is in a "good" or "bad" position.

1. The Lightweight Turn-Level Critic

Instead of using a second VLM as a critic (which doubles VRAM usage), Odysseus uses a simple CNN critic originally designed for classical Atari RL. This critic looks at the game frame and provides a scalar value. This makes PPO training feasible on consumer-grade hardware while providing the stable baseline needed for long-term rewards.

2. Positive-Advantage Filtering

The team found that samples with negative advantages (actions that performed worse than expected) often destabilized the VLM’s transformer weights. By clipping these at zero, they focused the model on climbing the gradient of "what worked."

Model Architecture and Adapted PPO

3. Auto-Curriculum

In multi-level training, agents often spend too much time on easy levels (where they survive longer) and neglect hard ones. Odysseus uses inverse trajectory weighting, automatically shifting training focus toward levels where the agent dies quickly, ensuring balanced mastery.

VLM-based RL vs. Classical RL: The Prior Knowledge Advantage

A fascinating insight from the paper is the comparison between Odysseus and a classic CNN-based RL agent trained from scratch.

  • Classical RL: Needs "action engineering" (manually limiting buttons) to even start learning.
  • VLM-based RL (Odysseus): Uses the VLM's inherent understanding of "right," "jump," and "enemy" to explore efficiently.

As a result, Odysseus is 2x more sample-efficient. It doesn't just learn to press buttons; it uses its world knowledge to understand why it should press them.

Performance Comparison Graph

Results & Generalization

Odysseus achieved massive gains across all five training levels. More importantly, it showed Zero-shot Generalization:

  • In-game: Handled unseen levels of Super Mario Land.
  • Cross-game: Successfully played Super Mario Bros (NES), a different game with similar mechanics.
  • General Intelligence: Unlike many RL agents that become "idiot savants," Odysseus retained its scores on benchmarks like MMMU and RealWorldQA.

Conclusion: A Blueprint for Embodied AI

Odysseus demonstrates that we don't necessarily need bigger models for better agents; we need smarter RL training loops. By offloading temporal reasoning to a lightweight critic and leveraging the rich visual priors of VLMs, we can build agents capable of sustained, complex interactions in the real (or virtual) world.

Key Takeaway: The future of embodied agents lies in hybrid architectures where "System 2" reasoning (VLM) is grounded by a "System 1" value estimator (CNN Critic).

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize lightweight external critics to stabilize the reinforcement learning of large multimodal models in long-context environments.
  • Identify the origin of "positive-advantage filtering" in reinforcement learning and how it has been applied to mitigate optimization instability in transformer-based agents.
  • Explore research applying VLM-based RL frameworks like Odysseus to other complex, long-horizon simulators such as Minecraft, NetHack, or real-world robotics simulators.
Contents
Odysseus: Scaling VLMs to 100+ Turn Decision-Making via Stabilized RL
1. TL;DR
2. The "Horizon" Problem: Why VLMs Get Lost
3. Methodology: The "Secret Sauce" of Odysseus
3.1. 1. The Lightweight Turn-Level Critic
3.2. 2. Positive-Advantage Filtering
3.3. 3. Auto-Curriculum
4. VLM-based RL vs. Classical RL: The Prior Knowledge Advantage
5. Results & Generalization
6. Conclusion: A Blueprint for Embodied AI