WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning
总结
问题
方法
结果
要点
摘要

ThinkJEPA is a VLM-guided Joint-Embedding Predictive Architecture (JEPA) that integrates long-horizon semantic reasoning from a Vision-Language Model (Qwen3-VL) with dense latent dynamics forecasting. It achieves State-of-the-Art performance in 3D hand manipulation trajectory prediction on EgoDex and EgoExo4D benchmarks.

TL;DR

ThinkJEPA bridges the gap between fine-grained physical dynamics and high-level semantic reasoning. By integrating a Vision-Language Model (VLM) "thinker" into a Joint-Embedding Predictive Architecture (V-JEPA), the framework achieves superior 3D trajectory forecasting. It essentially gives the "eyes" of a dense motion predictor the "brain" of a reasoning-heavy VLM.

Positioning: This work is a significant "System-2" enhancement to world models, moving beyond local extrapolation toward knowledge-aware forecasting.

The Problem: Perception Without Understanding

Current latent world models (like V-JEPA) are excellent at predicting the next few frames of a representation. However, they are "myopic"—they often lack the external knowledge to understand what is being manipulated or why certain events occur.

On the flip side, VLMs (like Qwen-VL or LLaVA) understand global context but are handicapped by:

  1. Sparse Sampling: They only look at a few frames to save compute.
  2. Language Bottleneck: Their features are optimized for text generation, often discarding precise spatial/interacton details.
  3. Physics Blindness: They can describe a task in words while failing to predict the exact 3D coordinates of a hand.

Methodology: The Dual-Temporal Strategy

The core innovation of ThinkJEPA lies in its Dual-Temporal Perception Field. Instead of choosing between dense frames or long horizons, it uses both.

1. The Dual Pathway

  • The JEPA Branch: Takes a short window of dense frames (e.g., 32 frames) to capture high-frequency motion.
  • The VLM Thinker Branch: Takes a sparse, uniformly sampled set of frames spanning the entire video. It provides the "big picture" (e.g., "The person is picking up a screwdriver").

2. Hierarchical Pyramid Extraction

Rather than just taking the last layer of the VLM, the authors argue that intermediate layers contain better visual reasoning signals. They extract features from a pyramid of layers (e.g., layers 0, 4, 8... 27) and inject them into the JEPA predictor using FiLM (Feature-wise Linear Modulation).

ThinkJEPA Model Architecture

Experiments & Results: Crushing the Baselines

The model was tested on EgoDex (dexterous manipulation) and EgoExo4D (skilled activity).

Key Quantitative Wins

  • Accuracy Boost: On EgoDex, ThinkJEPA reached 59.6% accuracy, significantly higher than the 47.1% of a standard V-JEPA predictor.
  • Robust Rollouts: In long-horizon scenarios (recursive rollout), the V-JEPA predictor's error compounds quickly. ThinkJEPA remains stable because the VLM guidance acts as a "semantic anchor."

Quantitative Comparison Table

Visual Evidence

Qualitative results show that ThinkJEPA avoids "temporal collapse" (where the model predicts the hand will stay still or overlap itself). The VLM-guided trajectories are smoother and better aligned with the actual physical task.

Qualitative Trajectory Visualization

Critical Analysis & Conclusion

Takeaways

  • Guidance over Standalone: The paper proves that VLMs shouldn't be the engine of a world model, but rather the navigator.
  • Intermediate Representations Matter: Moving away from the "last-layer" obsession in LLMs is crucial for hardware/physics tasks.

Limitations

  • Computational Cost: Running a full VLM (like Qwen3-VL Thinking) even on sparse frames is expensive. The caching strategy helps, but real-time inference might still be a challenge for robotics.
  • Data Scarcity: While it generalizes better than JEPA, it still relies on egocentric datasets which are limited in variety compared to general internet data.

Future Work: We can expect this "Thinker-Predictor" architecture to evolve into more complex embodied agents where the VLM handles the "Reasoning" (System 2) and a JEPA/Policy handles the "Intuition/Reflex" (System 1).

发现相似论文

试试这些示例

  • Search for recent papers that utilize Vision-Language Models as "thinkers" or semantic guides for latent world models in robotics or autonomous driving.
  • Which paper first introduced the V-JEPA architecture, and how does ThinkJEPA specifically modify the transformer predictor's injection mechanism compared to the original?
  • Explore studies investigating the use of hierarchical multi-layer representation extraction from LLMs/VLMs for downstream geometric or physical forecasting tasks.
目录
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning
1. TL;DR
2. The Problem: Perception Without Understanding
3. Methodology: The Dual-Temporal Strategy
3.1. 1. The Dual Pathway
3.2. 2. Hierarchical Pyramid Extraction
4. Experiments & Results: Crushing the Baselines
4.1. Key Quantitative Wins
4.2. Visual Evidence
5. Critical Analysis & Conclusion
5.1. Takeaways
5.2. Limitations