WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
RefAlign: Solving the "Copy-Paste" Artifact in Video Generation via Explicit Representation Alignment
Summary
Problem
Method
Results
Takeaways
Abstract

RefAlign is a novel Reference-to-Video (R2V) generation framework that introduces a Reference Alignment (RA) loss to align Diffusion Transformer (DiT) internal features with the semantic space of Vision Foundation Models (VFMs). It achieves SOTA performance on the OpenS2V-Eval benchmark, significantly improving identity consistency and reducing multi-subject confusion.

TL;DR

RefAlign is a training-only strategy that forces video diffusion models to "understand" reference images by aligning their internal features with Vision Foundation Models (VFMs). By using a contrastive-style Reference Alignment (RA) loss, it eliminates the common "copy-paste" look of AI videos and multi-subject confusion without adding a single millisecond to inference time.

Background: The Modality Mismatch Problem

Reference-to-Video (R2V) generation is the holy grail of personalized content creation—allowing users to specify a character (via image) and an action (via text). However, current models often treat the reference image as a "texture patch" to be stuck onto the video (the copy-paste artifact) or get confused when two different characters appear in the same scene (multi-subject confusion).

The root cause is Modality Mismatch: The VAE latents used for generation are optimized for reconstruction, while text prompts are semantic. When these two are fed into a DiT, the model lacks an explicit bridge to align the pixels of the image with the "identity" of the subject.

Methodology: Anchor, Pull, and Push

The core innovation of RefAlign is the Reference Alignment (RA) Loss. Instead of just letting the model learn implicitly, the authors use a "teacher" model (a VFM like DINOv3) to provide a semantic anchor.

1. The Architecture

RefAlign utilizes the Wan2.1 backbone. During training, it extracts intermediate features from the DiT's reference branch and projects them to match the VFM's feature space.

Model Architecture

2. The Loss Function: Positive vs. Negative

RefAlign doesn't just pull features together; it performs a sophisticated alignment:

  • Positive Term: Pulls the DiT's representation of Subject A closer to the VFM's representation of Subject A.
  • Negative Term: Pushes the DiT's representation of Subject A away from Subject B. This is critical for preventing "identity leakage" where two characters start to look like each other.

The beauty of this approach is that the VFM and the alignment MLP are discarded at inference time. You get a "smarter" model with zero extra VRAM or latency cost.

Experimental Battleground

The researchers tested RefAlign on the OpenS2V-Eval benchmark, comparing it against industry giants like Kling1.6 and academic SOTAs like VINO.

Qualitative Results

Key Findings:

  • Performance Jump: RefAlign-14B hit a TotalScore of 60.42%, establishing a new state-of-the-art.
  • Identity Fidelity: The FaceSim and NexusScore (consistency) metrics saw significant boosts, proving that the model actually "knows" who it is generating.
  • Ablation Success: Removing the L_neg (negative loss) led to immediate drops in naturalness and identity separation, proving that "pushing" subjects apart is as important as "pulling" them toward the ground truth.

Ablation Comparison

Critical Insight: Why DINOv3?

The authors explored different encoders for alignment:

  1. DINOv3: Excellent for Identity & Space. It provides strong instance-level discriminability.
  2. SigLIP2: Better for Global Semantics but can accidentally encode background noise into the subject's identity.
  3. Qwen2.5-VL: Good for Textual Alignment but adds heavy overhead and can lose local spatial stability.

DINOv3 emerged as the winner for R2V because it focuses on "what" the object is without being distracted by textual bias.

Conclusion & Future Outlook

RefAlign proves that the bottleneck in controllable video generation isn't just "more data" or "bigger models," but how we supervise the internal representations. By using VFMs as a "guide" during training, we can bridge the gap between pixel reconstruction and semantic understanding.

Limitations: The model is currently capped at 81 frames and relies on the diversity of the training dataset (OpenS2V-5M). Future iterations will likely combine multiple VFM signals to capture both the "identity" of a person and the "physics" of their motion.

Find Similar Papers

Try Our Examples

  • Search for recent papers in reference-to-video generation that attempt to solve "copy-paste" artifacts using visual foundation models other than DINOv3.
  • Which paper first proposed the Representation Alignment (REPA) method for Diffusion Transformers, and how does RefAlign specifically modify that objective for conditional generation?
  • Explore if RefAlign's negative alignment mechanism has been applied to multi-subject image-to-image or video editing tasks to improve semantic discriminability.
Contents
RefAlign: Solving the "Copy-Paste" Artifact in Video Generation via Explicit Representation Alignment
1. TL;DR
2. Background: The Modality Mismatch Problem
3. Methodology: Anchor, Pull, and Push
3.1. 1. The Architecture
3.2. 2. The Loss Function: Positive vs. Negative
4. Experimental Battleground
4.1. Key Findings:
5. Critical Insight: Why DINOv3?
6. Conclusion & Future Outlook