RefAlign is a novel Reference-to-Video (R2V) generation framework that introduces a Reference Alignment (RA) loss to align Diffusion Transformer (DiT) internal features with the semantic space of Vision Foundation Models (VFMs). It achieves SOTA performance on the OpenS2V-Eval benchmark, significantly improving identity consistency and reducing multi-subject confusion.
TL;DR
RefAlign is a training-only strategy that forces video diffusion models to "understand" reference images by aligning their internal features with Vision Foundation Models (VFMs). By using a contrastive-style Reference Alignment (RA) loss, it eliminates the common "copy-paste" look of AI videos and multi-subject confusion without adding a single millisecond to inference time.
Background: The Modality Mismatch Problem
Reference-to-Video (R2V) generation is the holy grail of personalized content creation—allowing users to specify a character (via image) and an action (via text). However, current models often treat the reference image as a "texture patch" to be stuck onto the video (the copy-paste artifact) or get confused when two different characters appear in the same scene (multi-subject confusion).
The root cause is Modality Mismatch: The VAE latents used for generation are optimized for reconstruction, while text prompts are semantic. When these two are fed into a DiT, the model lacks an explicit bridge to align the pixels of the image with the "identity" of the subject.
Methodology: Anchor, Pull, and Push
The core innovation of RefAlign is the Reference Alignment (RA) Loss. Instead of just letting the model learn implicitly, the authors use a "teacher" model (a VFM like DINOv3) to provide a semantic anchor.
1. The Architecture
RefAlign utilizes the Wan2.1 backbone. During training, it extracts intermediate features from the DiT's reference branch and projects them to match the VFM's feature space.

2. The Loss Function: Positive vs. Negative
RefAlign doesn't just pull features together; it performs a sophisticated alignment:
- Positive Term: Pulls the DiT's representation of Subject A closer to the VFM's representation of Subject A.
- Negative Term: Pushes the DiT's representation of Subject A away from Subject B. This is critical for preventing "identity leakage" where two characters start to look like each other.
The beauty of this approach is that the VFM and the alignment MLP are discarded at inference time. You get a "smarter" model with zero extra VRAM or latency cost.
Experimental Battleground
The researchers tested RefAlign on the OpenS2V-Eval benchmark, comparing it against industry giants like Kling1.6 and academic SOTAs like VINO.

Key Findings:
- Performance Jump: RefAlign-14B hit a TotalScore of 60.42%, establishing a new state-of-the-art.
- Identity Fidelity: The FaceSim and NexusScore (consistency) metrics saw significant boosts, proving that the model actually "knows" who it is generating.
- Ablation Success: Removing the
L_neg(negative loss) led to immediate drops in naturalness and identity separation, proving that "pushing" subjects apart is as important as "pulling" them toward the ground truth.

Critical Insight: Why DINOv3?
The authors explored different encoders for alignment:
- DINOv3: Excellent for Identity & Space. It provides strong instance-level discriminability.
- SigLIP2: Better for Global Semantics but can accidentally encode background noise into the subject's identity.
- Qwen2.5-VL: Good for Textual Alignment but adds heavy overhead and can lose local spatial stability.
DINOv3 emerged as the winner for R2V because it focuses on "what" the object is without being distracted by textual bias.
Conclusion & Future Outlook
RefAlign proves that the bottleneck in controllable video generation isn't just "more data" or "bigger models," but how we supervise the internal representations. By using VFMs as a "guide" during training, we can bridge the gap between pixel reconstruction and semantic understanding.
Limitations: The model is currently capped at 81 frames and relies on the diversity of the training dataset (OpenS2V-5M). Future iterations will likely combine multiple VFM signals to capture both the "identity" of a person and the "physics" of their motion.
