UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

[CVPR 2026] UniTalking: Bridging the Gap in Unified Audio-Video Talking Portrait Generation

总结

问题

方法

结果

要点

摘要

UniTalking is a unified, end-to-end framework for generating high-fidelity talking portraits with perfectly synchronized audio and video. Built on a symmetric dual-stream Multi-Modal Diffusion Transformer (MM-DiT) and Flow Matching, it achieves state-of-the-art open-source performance in lip-sync accuracy and personalized voice cloning.

Executive Summary

TL;DR: UniTalking is a 10-billion parameter unified diffusion framework that generates synchronized talking portraits (video + speech) from text and reference images/audio. By moving away from "cascaded" models (where video follows audio) into a "joint" latent space where both modalities are generated simultaneously, it achieves state-of-the-art lip-sync and visual fidelity.

Background: Within the AIGC landscape, high-end models like Google’s Veo and OpenAI’s Sora have set a high bar for audio-visual coherence but remain closed. UniTalking represents a massive leap for the open-source community, providing a reproducible path to high-fidelity, synchronized digital human synthesis.

Problem & Motivation: The "Cascaded" Failures

Most existing "Talking Head" models are cascaded: they take an existing audio file and then try to "drive" a face to match it. This often leads to:

Temporal Drifting: The lips slightly lag or lead the sound.
Lack of Cohesion: The facial expressions don't naturally "feel" like they are producing the sound.
The "Proprietary Barrier": While models like Sora2 handle this well, the academic community lacked a unified, end-to-end architecture that treats visemes (visual mouth shapes) and phonemes (auditory units) as two sides of the same coin.

Methodology: The Power of Symmetry

The core innovation of UniTalking is the Symmetric Dual-Stream MM-DiT.

1. Joint Attention Mechanism

Instead of using cross-attention (where video "looks" at audio), UniTalking concatenates audio and video tokens into a single sequence. This forces the self-attention mechanism to model the dependency between a specific audio frequency (the "pop" of a 'P' sound) and the visual closing of the lips in the same block.

Architecture of UniTalking

2. Anisotropic Position Embeddings

To help the model understand that audio and video share a time axis but different "spatial" properties, the authors used a specialized Rotary Positional Embedding (RoPE). They fixed the spatial position for audio tokens while allowing temporal rotation, effectively forcing the model to focus on temporal synchronization.

3. Progressive Training

Because the video branch (based on Wan2.2) is already "smart" and the audio branch starts "from scratch," the authors used a two-step process:

Step 1: Train the audio branch on Text-to-Speech (TTS) to give it basic linguistic "intelligence."
Step 2: Jointly train both branches on 2.3 million curated human-centric video clips to master the "dance" between the two.

Experiments & Results: Rivaling Closed Models

UniTalking was tested against both open-source SOTA (Universe-1, OVI) and closed-source giants (Sora2).

Audio-Visual Sync Performance

The model achieved a Sync-C score of 4.87, significantly outperforming Universe-1 and narrowing the gap with Sora2. Quantitatively, users preferred UniTalking's audio quality and synchronization by over 100% compared to earlier open-source iterations.

Lip-Sync Samples

Personalized Voice Cloning

Beyond just talking, the model supports personalization. By providing a 3-5 second audio clip, the model can clone the target's timbre, achieving speaker similarity scores (0.703) comparable to dedicated industry solutions like ElevenLabs.

Critical Analysis & Conclusion

Takeaway

UniTalking demonstrates that unified architecture + symmetric scaling is the most promising path for complex multi-modal generation. By leveraging a high-quality visual prior (Wan2.2) and forcing inter-modal dependency through joint attention, the researchers have created a framework that is both efficient and incredibly accurate.

Limitations

Despite its success, the model currently struggles with:

Multi-person scenes: It is primarily optimized for single-portrait talking heads.
Data Scale: It still lags slightly behind the "limitless" data access of proprietary giants.

Future Outlook

The "UniTalking" framework isn't just for speech. The authors suggest this architecture can be extended to general video-to-audio (foley, music, environmental sounds) by simply changing the training data, paving the way for a truly "Universal Audio-Video Generator."

发现相似论文

试试这些示例

Search for recent papers that utilize Joint Attention or "Stitching of Experts" for unified audio-visual generation since late 2025.
Which paper first introduced the Multi-Modal Diffusion Transformer (MM-DiT) architecture, and how has its scaling behavior been characterized in subsequent research?
Explore how Flow Matching objectives are being applied to multi-modal tasks beyond video, such as 3D avatar generation or interactive digital humans.

[CVPR 2026] UniTalking: Bridging the Gap in Unified Audio-Video Talking Portrait Generation

1. Executive Summary

2. Problem & Motivation: The "Cascaded" Failures

3. Methodology: The Power of Symmetry

3.1. 1. Joint Attention Mechanism

3.2. 2. Anisotropic Position Embeddings

3.3. 3. Progressive Training

4. Experiments & Results: Rivaling Closed Models

4.1. Audio-Visual Sync Performance

4.2. Personalized Voice Cloning

5. Critical Analysis & Conclusion

5.1. Takeaway

5.2. Limitations

5.3. Future Outlook