DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

[CVPR 2025] DyaDiT: Beyond Audio-to-Motion — Shaping Socially Aware Digital Humans

Summary

Problem

Method

Results

Takeaways

Abstract

DyaDiT is a multi-modal Diffusion Transformer designed to generate socially consistent dyadic (two-person) conversational gestures. It leverages a novel Orthogonalization Cross Attention (ORCA) module and a motion dictionary to synthesize upper-body movements that align with speech, partner dynamics, and social contexts like relationship types and personality traits.

TL;DR

DyaDiT is a new multi-modal Diffusion Transformer that generates human gestures for dyadic conversations. Unlike prior models that just "watch the audio," DyaDiT understands the social fabric of the interaction—considering who the speakers are (friends vs. strangers) and how they behave (extraverted vs. neurotic). By introducing a novel audio-disambiguation module (ORCA), it achieves state-of-the-art realism and diversity.

Background: The "Social Gap" in AI Gestures

We are entering an era of Large Language Models (LLMs) that sound human, but their avatars often look like stiff puppets. The core issue? Most AI gesture models treat human interaction as a simple mapping of Audio -> Movement.

In reality, conversation is a dyadic dance. We react to our partner, we interrupt, and our movements are filtered through our personalities and relationships. Existing models struggle with:

Audio Entanglement: When two people talk over each other, models get confused about who is responding to whom.
Social Blindness: A person talking to their boss moves differently than when talking to a spouse, yet current models produce the same "generic" gestures for both.

Methodology: The Architecture of Interaction

DyaDiT addresses these gaps through three innovative components:

1. ORCA: Solving the "Interruption" Problem

The Orthogonalization Cross Attention (ORCA) module is the secret sauce for handling dyadic audio. It uses a projection mechanism to strip away redundant information between the two audio streams. This ensures the model identifies specific reactive cues—like an intake of breath or a change in pitch—without getting "muddied" by the other person's voice.

DyaDiT Architecture Figure 2: The DyaDiT pipeline, showing the fusion of audio, social context, and partner motion.

2. Social-Context Conditioning

The model utilizes FiLM (Feature-wise Linear Modulation) to inject relationship types (Family, Dating, Stranger, Friend) and Big Five personality scores into the Transformer blocks. This allows the diffusion process to "steer" the motion towards being more expressive (high extraversion) or perhaps more reserved.

3. Motion Dictionary (MD)

To avoid "average" or "blurry" movements, a learnable dictionary of motion primitives provides a prior for what natural gestures look like, enhancing the stylistic richness of the output.

Experiments & Results: Better than Reality?

The authors tested DyaDiT on the Seamless Interaction Dataset, a massive 182-hour corpus of natural human behavior.

Quantitative Dominance: DyaDiT achieved the lowest Fréchet Distance (FD) recorded to date for this task, meaning its movement distribution is closest to real human motion.
User Preference: In A/B tests, users preferred DyaDiT over the previous SOTA (ConvoFusion) by a landslide (73.9%).

Experimental Results Table 1: DyaDiT consistently outperforms baselines in both realism (FD) and diversity.

Surprisingly, DyaDiT even slightly outperformed Ground Truth data in user studies. This is likely because real-world motion capture often contains "jitter" or sensor noise, whereas the Diffusion Transformer acts as a natural regularizer, producing "idealized," smooth, and highly expressive versions of human movement.

Critical Insight: The Future of Digital Humans

DyaDiT proves that context is king. By moving away from "black-box" audio-to-motion and toward "socio-aware" synthesis, this research paves the way for AI avatars that don't just speak, but truly interact.

Limitations: Currently, the model is limited to upper-body gestures. As the authors look toward the future, extending this to full-body coordination and facial micro-expressions will be the final frontier for passing the "visual Turing Test" in digital interactions.

Conclusion

DyaDiT represents a shift from generative modeling to social modeling. It’s a blueprint for the next generation of NPCs and AI assistants—agents that know who they are talking to and move accordingly.

Find Similar Papers

Try Our Examples

Search for recent papers that incorporate the "Big Five" personality traits or social relationship labels into human motion synthesis or digital human animation.
What are the seminal works on "Orthogonalization" or "Disentanglement" in multi-modal cross-attention, and how does ORCA specifically adapt these for audio signal processing?
Research applications of Diffusion Transformers (DiT) in other dyadic interaction tasks, such as collaborative robotics or multi-agent reinforcement learning for social navigation.

Contents

[CVPR 2025] DyaDiT: Beyond Audio-to-Motion — Shaping Socially Aware Digital Humans

1. TL;DR

2. Background: The "Social Gap" in AI Gestures

3. Methodology: The Architecture of Interaction

3.1. 1. ORCA: Solving the "Interruption" Problem

3.2. 2. Social-Context Conditioning

3.3. 3. Motion Dictionary (MD)

4. Experiments & Results: Better than Reality?

5. Critical Insight: The Future of Digital Humans

6. Conclusion