MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

[CVPR 2025] MegaStyle: Solve Style Transfer via Consistent Text-to-Image Style Mapping

Summary

Problem

Method

Results

Takeaways

Abstract

MegaStyle introduces a scalable data curation pipeline that leverages the consistent text-to-image (T2I) mapping of large generative models to create a high-quality style dataset. The authors release MegaStyle-1.4M, a massive dataset of style-consistent pairs, alongside MegaStyle-Encoder for style representation and MegaStyle-FLUX for generalizable style transfer.

TL;DR

MegaStyle addresses the long-standing "content leakage" and inconsistency issues in image style transfer by shifting from self-supervised learning to paired supervision. The authors leverage the innate ability of large T2I models (like Qwen-Image) to map a single text description to a consistent visual style across multiple contents. This results in MegaStyle-1.4M, a dataset that enables state-of-the-art style encoding and transfer models.

Problem & Motivation: The Consistency Gap

Why is style transfer still hard? Most current SOTA models use CLIP-based encoders. However, CLIP's latent space is primarily semantic, not stylistic. When you ask a model to "copy" the style of an image, it often captures the objects in that image too—this is known as content leakage.

Furthermore, existing datasets like WikiArt are biased: an artist's "style" actually changes over time or by subject matter. The authors argue that we need intra-style consistency and inter-style diversity. If we can't find these "perfect pairs" (same style, different content) in the wild, we must curate them synthetically.

Methodology: The MegaStyle Pipeline

The core innovation lies in the Consistent T2I Style Mapping. Instead of using traditional style transfer to create data (which is often low quality), they use a VLM to generate highly detailed style "recipes."

Style Captioning: Using Qwen3-VL to describe color, light, medium, texture, and brushwork while ignoring objects.
Content Captioning: Describing only the objects and relationships, ignoring aesthetics.
Generation: Combining a single style prompt with multiple content prompts to generate "style pairs" that are perfectly aligned in aesthetic but distinct in structure.

Figure 3: Overview of the MegaStyle data curation pipeline.

MegaStyle-Encoder & SSCL

To measure style similarity accurately, the authors proposed Style-Supervised Contrastive Learning (SSCL). By using the known style labels from their synthetic dataset, they force the encoder to pull images with the same "style prompt" together in latent space, regardless of their content. This creates a representation that is truly "style-specific."

MegaStyle-FLUX

For the transfer model, they utilized the FLUX.1 (DiT) architecture. By concatenating reference style tokens with noisy target tokens and using shifted RoPE (Rotary Positional Embeddings), they prevent positional collisions, ensuring the model knows which image provides the style and which provides the canvas.

MegaStyle-FLUX Architecture Figure 7: The architecture of MegaStyle-FLUX showing the token concatenation and shifted RoPE.

Experiments & Results: A New Benchmark

The results on the new StyleRetrieval benchmark are staggering. While standard CLIP models struggle to find similar styles when content changes, MegaStyle-Encoder identifies the correct style with nearly 90% accuracy.

Qualitative Results Figure 1: Visualizations showing high intra-style consistency in the dataset and precise transfer in the FLUX model.

In the human preference study, MegaStyle-FLUX dominated in both style consistency and text alignment. Unlike previous methods (like Attention-Distillation) which often produce "carbon copies" of the reference image, MegaStyle-FLUX successfully creates new content in the target style.

Critical Analysis & Conclusion

The main takeaway from MegaStyle is the transition from "Retrieval-based" styles to "Prompt-defined" styles. By defining style through language and then synthesizing a dataset, the authors have bypassed the "messiness" of real-world art datasets.

Limitations: The model's "imagination" is limited by the VLM's vocabulary and the T2I model's biases (e.g., "Japanese painting" prompts often forcing traditional Japanese subject matter). However, the scalability of this pipeline suggests that as VLMs improve, MegaStyle can scale to 10M+ images, potentially "solving" the style transfer problem for good.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize synthetic data generation from LLMs/VLMs specifically to solve the data scarcity problem in specialized computer vision tasks like style transfer or image editing.
What are the original theoretical foundations of Supervised Contrastive Learning (SCL) as proposed by Khosla et al., and how has its objective function been modified for cross-modal or style-specific representations?
Explore research applying Diffusion Transformers (DiT) or the FLUX architecture to zero-shot image-to-image translation and artistic stylization.

Contents

[CVPR 2025] MegaStyle: Solve Style Transfer via Consistent Text-to-Image Style Mapping

1. TL;DR

2. Problem & Motivation: The Consistency Gap

3. Methodology: The MegaStyle Pipeline

3.1. MegaStyle-Encoder & SSCL

3.2. MegaStyle-FLUX

4. Experiments & Results: A New Benchmark

5. Critical Analysis & Conclusion