DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

[CVPR 2025] DISPLAY: Shifting from Dense Constraints to Sparse Guidance for Masterful Human-Object Interactions

总结

问题

方法

结果

要点

摘要

DISPLAY is a novel human-centric video generation framework that achieves directable and physically consistent Human-Object Interaction (HOI) by utilizing a sparse motion guidance (wrist joints and shape-agnostic bounding boxes). Built on the Wan2.1-14B DiT-based flow matching model, it achieves state-of-the-art results in visual quality and interaction fidelity (CA score: 0.891).

TL;DR

Generating videos where humans interact naturally with objects has long been a "valley of death" for AI. Most models either deform the object or fail to make the hands actually "touch" the item. DISPLAY (Directable human-ob ject Interaction video generation via SParse motion guidance and muLti-task AuxiliarY) solves this by abandoning cluttered dense poses in favor of sparse wrist points and bounding boxes, combined with an attention mechanism that "stresses" the object's importance.

The "Representation Asymmetry" Trap

Why do current SOTA models fail at HOI? The authors identify a fundamental flaw: representation asymmetry. In most pose-guided models, the human hand is defined by intricate 21-keypoint skeletons or 3D meshes, while the object is often just a vague latent blob. When the Diffusion model trains, it over-optimizes for the dense hand signals, treating the object as an afterthought. This leads to clipping (fingers passing through solid matter) and "melting" objects.

Methodology: The Power of Less

DISPLAY takes a "less is more" approach. By using only the wrist coordinates (the end-effector) and a shape-agnostic bounding box, the model is forced to learn the relationship between the two rather than just filling in pixels.

1. The Architecture: ControlNet-style Injection

The framework utilizes a frozen Wan2.1-14B backbone. A "Condition Branch" (cloned DiT layers) processes the sparse guidance and injects it into the main generation flow.

Model Architecture

2. Object-Stressed Attention (OSA)

To find the object in the "noise," the researchers introduced Object-Stressed Attention. It scales the attention scores between object tokens ( $x_{o bj}$ ) and the rest of the scene ( $x_{e l se}$ ) using a hyperparameter $α$ . This ensures the model pays extra attention to the contact boundaries where the hand meets the object.

Object - Stressed - Attention = softmax (\frac{1}{d} [α^{2} x_{obj} x_{obj}^{o p} α x_{else} x_{obj}^{o p} α x_{obj} x_{else}^{o p} x_{else} x_{else}^{o p}]) \dots

Experiments: Superior Fidelity and Control

The quantitative results show a clear lead. DISPLAY isn't just better at looks (FID); it's significantly better at Contact Agreement (CA) and Object Fidelity (O-CLIP).

Quantitative Results

In qualitative testing (below), note how DISPLAY (bottom row) maintains the structural integrity of the object compared to HunyuanCustom and VACE, which tend to blur the interaction zone.

Qualitative Comparison

Deep Insights: Why It Works

The brilliance of DISPLAY lies in its Multi-Task Auxiliary Training. High-quality HOI data (like someone picking up a specific phone) is rare. By training on a mix of:

HOI-annotated clips (50 hours)
General human motion (100 hours) The model develops a "physical intuition" for how a human moves, and then applies that intuition to the sparse object boxes. The Bernoulli-sampled masking during training further ensures the model can handle missing frames, enabling smooth long-video generation (up to 1 minute) without the dreaded "drift."

Summary & Limitations

DISPLAY opens new doors for e-commerce (virtual product demos) and entertainment. While it currently struggles with non-rigid objects (like a person squeezing a plush toy), its success with rigid objects and sparse guidance sets a new standard for controllable video synthesis.

Takeaway: If you want a model to understand physics, stop over-constraining it with skeletons. Give it a goal (the wrist) and a target (the box), and let the attention mechanism do the heavy lifting.

发现相似论文

试试这些示例

Search for recent papers using sparse control signals or end-effector trajectories for human-centric video generation beyond wrist points.
Which original research introduced the Object-Stressed Attention concept, and how does DISPLAY's implementation differ from standard weighted attention in DiT architectures?
Explore how multi-task auxiliary training strategies are being applied to solve data scarcity in specialized video generation tasks like robotic manipulation or hand-object interaction.

[CVPR 2025] DISPLAY: Shifting from Dense Constraints to Sparse Guidance for Masterful Human-Object Interactions

1. TL;DR

2. The "Representation Asymmetry" Trap

3. Methodology: The Power of Less

3.1. 1. The Architecture: ControlNet-style Injection

3.2. 2. Object-Stressed Attention (OSA)

4. Experiments: Superior Fidelity and Control

5. Deep Insights: Why It Works

6. Summary & Limitations