PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

[PokeGym] Escaping the Pixel Trap: Why 3D Open-Worlds are the Ultimate Test for VLMs

Summary

Problem

Method

Results

Takeaways

Abstract

PokeGym is a visually-driven, long-horizon benchmark for Vision-Language Models (VLMs) instantiated in a complex 3D open-world RPG, Pokémon Legends: Z-A. It features 30 automated tasks spanning navigation and interaction, requiring agents to operate solely on raw RGB pixels while achieving success verified by independent memory scanning.

TL;DR

PokeGym is a new 3D benchmark utilizing Pokémon Legends: Z-A to push Vision-Language Models (VLMs) to their limits. By enforcing pure-pixel inputs and providing automated, memory-based scoring, it reveals a startling truth: the smartest AI models are often physically "clumsy," getting stuck behind simple fences because they lack 3D spatial intuition, even when they know they are trapped.

The "Privileged Information" Problem

Most current "embodied" AI benchmarks cheat. They give the agent hidden data—like exact coordinates or object IDs—which bypasses the most difficult part of being an agent: vision. When you remove the training wheels and force an agent to look at raw RGB frames, the performance of even the most advanced models (like GPT-5 series or Gemini-3-Pro) takes a massive hit.

The authors of PokeGym argue that true embodied competence requires four things: long-horizon planning, realistic 3D visuals, pixel-only observation, and scalable automated evaluation. Existing benchmarks like ALFRED or MineDojo usually trade one for another. PokeGym solves this by using a modern game engine while keeping the internal game state hidden from the AI.

Methodology: Probing the VLM Brain

PokeGym isn't just a leaderboard; it’s a diagnostic tool. It uses three levels of "Instruction Granularity" to see where the AI's cognitive chain breaks:

Visual-Guided: Includes visual anchors (e.g., "look for the red house"). Tests Visual Grounding.
Step-Guided: Provides sub-goals but no visual cues. Tests Semantic Reasoning.
Goal-Only: Only gives the final objective. Tests Autonomous Exploration.

Overall Architecture of PokeGym

The architecture uses the Ryujinx emulator, extracting frames directly from GPU textures to eliminate lag, while an independent process scans memory bytes (AOB) to check if the agent reached the goal.

Key Insight: The Metadata of Failure

The most profound discovery in the paper is the Metacognitive Divergence in failures. The researchers categorized why models fail when they get stuck (deadlocks):

Unaware Deadlocks: Weaker models (like earlier Qwen versions) are stuck against a wall but hallucinate that they are making progress. They lack 3D state estimation.
Aware Deadlocks: Advanced models (like GPT-5.2) know they are stuck. Their internal reasoning says: "I am hitting a wall." However, they lack the spatial intuition to execute a "step back and rotate" maneuver, instead flailing in a high-entropy loop.

Correlation between Success Rate and Ineffective Moves

The chart above shows a brutal negative correlation: as "Ineffective Moves" (collisions) increase, the success rate plummets. This proves that low-level physical friction—the "embodied tax"—is a bigger bottleneck than high-level logic.

Experimental Results: The Leaderboard

The results show that proprietary models still dominate, with Gemini-3-Pro and GPT-5.2 sharing the top spot. However, "Mixed" tasks (combining navigation and combat) remain nearly impossible for most, with success rates often dropping below 10% for Goal-Only instructions.

PokeGym Tasks and Budgets

Deep Insight & Conclusion

PokeGym reveals that the current path to "Generalist Agents" cannot rely solely on scaling LLM reasoning. We are seeing a "Physical-Cognitive Gap." A model can be a PhD-level scientist in text, but an infant in a 3D environment.

The Takeaway: To solve PokeGym, we don't need smarter planners; we need models with Spatial Intuition. We need architectures that understand depth, occlusion, and collision as primary concepts, not just as pixels to be captioned. The future of robotics and embodied AI likely lies in bridging this gap between high-level language and low-level geometry.

Find Similar Papers

Try Our Examples

Search for recent papers that address the "spatial deadlock" or "collision recovery" problem specifically for Vision-Language Models in 3D embodied environments.
Which paper first proposed using memory-based signature scanning (AOB) for automated evaluation in RL or LLM agent benchmarks, and how does PokeGym's implementation differ?
Find research that compares the performance of flagship VLMs (like GPT-4o or Gemini 1.5) on open-world RPG benchmarks versus specialized robotics simulation benchmarks.

Contents

[PokeGym] Escaping the Pixel Trap: Why 3D Open-Worlds are the Ultimate Test for VLMs

1. TL;DR

2. The "Privileged Information" Problem

3. Methodology: Probing the VLM Brain

4. Key Insight: The Metadata of Failure

5. Experimental Results: The Leaderboard

6. Deep Insight & Conclusion