LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

[CVPR 2026] LumosX: Harmonizing Identities and Attributes for Personalized Multi-Subject Video Generation

Summary

Problem

Method

Results

Takeaways

Abstract

LumosX is a novel framework for personalized multi-subject video generation that introduces explicit face-attribute dependency modeling. It leverages a new data collection pipeline and "Relational Attention" mechanisms to achieve state-of-the-art results in identity-consistent and subject-consistent video synthesis using the Wan2.1 Diffusion Transformer (DiT) backbone.

TL;DR

LumosX solves the "who is wearing what" problem in AI video generation. By introducing Relational Attention and a specialized data pipeline, it explicitly binds faces to their specific attributes (clothes, accessories), preventing the common "ID-swapping" or "attribute bleeding" seen in previous multi-subject models like Phantom or SkyReels-A2.

The Problem: Identity Confusion in Multi-Subject Scenes

While Text-to-Video (T2V) models have reached incredible levels of realism, personalization remains a "wild west." When you ask a model to generate "A man in a red hoodie and a woman in a blue dress," current DiT (Diffusion Transformer) architectures often mix them up. The model might put the red hoodie on the woman or blend their facial features. This happens because the Inductive Bias of standard attention is too global; it doesn't know that specific visual tokens (the face) must "stay" with specific attribute tokens (the clothes).

Methodology: The "Relational" Secret Sauce

LumosX tackles this through two core innovations: a sophisticated data engine and a relational architectural overhaul.

1. The Relational Data Engine

The researchers built a pipeline that transforms raw video into "relational training pairs." Using Qwen2.5-VL, the system doesn't just caption a video; it identifies subjects and matches them to their attributes (e.g., "Person A: green shirt, glasses"). It then segments these elements to create clean condition images for training.

Data Construction Pipeline

2. Relational Rotary Position Embedding (R2PE)

In a standard DiT, tokens are indexed sequentially. LumosX modifies 3D-RoPE so that a face and its attributes share the same "temporal index" ( $i$ -index) and are grouped in the spatial manifold. This forces the model to realize these tokens belong to the same "entity" before any attention is even calculated.

3. Masked Attention (CSAM & MCAM)

The architecture introduces two specialized masks:

Causal Self-Attention Mask (CSAM): Ensures that video denoising tokens can "see" the reference images (unidirectional attention), but condition images only interact within their own subject group.
Multilevel Cross-Attention Mask (MCAM): A numerical mask that strengthens the bond between a visual subject and its corresponding text description, while suppressing interference from other subjects' descriptions.

Framework Overview

Experiments: SOTA Performance

LumosX was tested against heavyweights like Phantom and SkyReels-A2. While competitors often struggle with "character confusion" in scenes with 2 or 3 people, LumosX maintains identity fidelity (ArcSim: 0.454 in 3-subject scenes, significantly outperforming baselines).

Subject Consistency Comparison

The ablation studies prove that the MCAM is the heavy hitter—setting the hyperparameter $r = 0.5$ provides the optimal balance between keeping the face consistent and keeping the video quality high.

Critical Insight & Conclusion

LumosX proves that attention is not all you need—you also need structured attention. By moving away from "global" attention towards a more "hierarchical/relational" approach, LumosX paves the way for complex cinematic generation where multiple actors can interact without losing their unique visual identities.

The main limitation? It is currently capped by the base model's (Wan2.1-1.3B) inherent capacity. As the authors scale this to 14B+ models, we can expect "LumosX-Large" to potentially replace human editors for simple multi-actor scene assembly.

Find Similar Papers

Try Our Examples

Search for recent papers on multi-subject personalization in diffusion models that utilize structural or relational attention masks to prevent attribute leakage.
Which original research introduced the 3D-RoPE (Rotary Position Embedding) for video transformers, and how does LumosX's R2PE modification specifically break its sequential indexing logic?
Investigate how large vision-language models like Qwen2.5-VL are increasingly being used as "data engine" teachers to create specialized datasets for video generation controllability.

Contents

[CVPR 2026] LumosX: Harmonizing Identities and Attributes for Personalized Multi-Subject Video Generation

1. TL;DR

2. The Problem: Identity Confusion in Multi-Subject Scenes

3. Methodology: The "Relational" Secret Sauce

3.1. 1. The Relational Data Engine

3.2. 2. Relational Rotary Position Embedding (R2PE)

3.3. 3. Masked Attention (CSAM & MCAM)

4. Experiments: SOTA Performance

5. Critical Insight & Conclusion