WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2025] ViHOI: Bridging the 3D Interaction Gap with Visual Priors from 2D World Knowledge
总结
问题
方法
结果
要点
摘要

The paper introduces ViHOI, a novel diffusion-based framework for 3D Human-Object Interaction (HOI) synthesis that utilizes 2D visual priors extracted via a Large Vision-Language Model (VLM). By incorporating reference images as conditioning signals, it achieves state-of-the-art performance and significantly outperforms existing methods in generalizing to unseen objects.

TL;DR

Synthesizing realistic 3D Human-Object Interactions (HOI) is notoriously difficult because text alone cannot describe complex physical constraints. ViHOI solves this by using a Large Vision-Language Model (VLM) to extract "interaction priors" from 2D images. By leveraging synthesized images during inference, the model achieves unprecedented generalization to unseen objects, effectively teaching AI to understand how to interact with things by looking at pictures.

The Problem: Text is a Poor Blueprint for Physics

Imagine being told to "pick up a box" without knowing if it's a jewelry box or a massive shipping crate. Existing HOI models face this exact "one-to-many" mapping problem. Textual prompts like "lift the chair" lack:

  • Geometry: The specific shape and size of the object.
  • Affordance: Where the hands should naturally touch.
  • Spatial Dynamics: The relative scale between the human and the environment.

Prior attempts to fix this used LLMs to expand text or added explicit contact maps, but these often feel "disconnected" from the overall body motion, leading to the dreaded "floating object" effect or hands passing through solid wood.

Methodology: Deep-Diving into the Vision-Aware Generator

ViHOI introduces a paradigm shift: Image-as-Motion-Prior. The architecture is divided into two major engines: a VLM-based Prior Extractor and a Vision-aware HOI Generator.

1. Decoupled Prior Extraction (The "Brain")

The authors utilize Qwen2.5-VL as their engine. Instead of taking the final output, they "tap into" different layers of the model:

  • Spatial-Visual Prior: Extracted from early layers (3rd layer) to capture raw geometric detail.
  • Semantic Control: Extracted from late layers (12th layer) to ensure the motion aligns with the textual intent.

2. The Q-Former Adapter (The "Filter")

VLM embeddings are massive. Directly feeding them to a diffusion model is like trying to drink from a firehose. ViHOI uses a Q-Former-based adapter—a learnable bottleneck that distills these high-dimensional signals into compact "interaction tokens."

Model Architecture Figure 2: The overall ViHOI pipeline. Notice how the training uses rendered GT images, while inference uses images synthesized by a T2I model like Nano Banana.

Experiments: Solving the "Unseen Object" Challenge

One of the most impressive feats of ViHOI is its performance on the 3D-FUTURE dataset, which consists of complex furniture categories the model never saw during training.

| Method | FID ↓ | MPJPE ↓ | Contact F1 ↑ | | :--- | :--- | :--- | :--- | | CHOIS (SOTA) | 0.77 | 15.43 | 0.70 | | ViHOI (Ours) | 0.68 | 14.97 | 0.75 |

The results show that even when the reference images are "hallucinated" by a Text-to-Image (T2I) model, the high-level semantic cues in those images are enough for ViHOI to generate physically plausible motions.

Visual Comparison Figure 4: Qualitative comparison. Competitors like MDM and ROG suffer from object drifting and penetration, whereas ViHOI maintains stable contact and realistic trajectories.

Why It Works: The "Implicit Synergy"

The genius of ViHOI lies in its inference strategy. By using a T2I generator (like Nano Banana) to create reference images for a prompt, the system taps into the "World Knowledge" of models trained on billions of images. This knowledge includes how a human should look when lifting a heavy monitor versus a light trash can—priors that are virtually impossible to encode in a standard 3D motion dataset.

Conclusion & Future Outlook

ViHOI proves that 2D visual foundations are a "cheat code" for 3D physics. By decoupling the vision and text layers of a VLM, the authors have created a plug-and-play module that upgrades any existing motion diffusion model.

Limitations: Currently, the model lacks fine-grained finger coordination (dexterous manipulation) because existing HOI datasets are primarily full-body focused. Adding high-fidelity hand-tracking priors would be the next logical step for this research.

Takeaway: If you want your 3D agents to interact with the world like humans do, stop telling them what to do—show them.

发现相似论文

试试这些示例

  • Find recent papers from 2024-2025 that explore using 2D image or video diffusion models specifically as priors for 3D human motion synthesis.
  • What is the architectural origin of the Q-Former in BLIP-2, and how has its use evolved beyond image captioning into conditional motion generation?
  • Investigate how other Human-Object Interaction (HOI) research projects manage the gap between synthetic training renderings and realistic generated images during inference.
目录
[CVPR 2025] ViHOI: Bridging the 3D Interaction Gap with Visual Priors from 2D World Knowledge
1. TL;DR
2. The Problem: Text is a Poor Blueprint for Physics
3. Methodology: Deep-Diving into the Vision-Aware Generator
3.1. 1. Decoupled Prior Extraction (The "Brain")
3.2. 2. The Q-Former Adapter (The "Filter")
4. Experiments: Solving the "Unseen Object" Challenge
5. Why It Works: The "Implicit Synergy"
6. Conclusion & Future Outlook