WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[CVPR 2025] PixelSmile: Mastering the Continuous Manifold of Facial Expressions
总结
问题
方法
结果
要点
摘要

The paper introduces PixelSmile, a diffusion-based facial expression editing framework that achieves fine-grained, linearly controllable manipulation. It leverages the new Flex Facial Expression (FFE) dataset with continuous affective annotations and a fully symmetric joint training paradigm to reach SOTA performance in expression disentanglement and identity preservation.

TL;DR

Facial expression editing has moved beyond simple "Happy vs. Sad" swaps. PixelSmile addresses the "uncanny valley" of overlapping emotions (like the subtle line between Surprise and Fear) by introducing a diffusion framework trained on a new 60,000-image dataset with continuous affective annotations. By using symmetric joint training and textual latent interpolation, it allows for smooth, linear control over expression intensity while keeping the person's identity rock-solid.

The Problem: The "Semantic Overlap" Trap

Humans don't express emotions in discrete buckets. Expressions exist on a continuous semantic manifold. Current generative models, however, are often trained on rigid, one-hot labels. This "one-hot bottleneck" leads to two major failures:

  1. Structured Confusion: Trying to make someone "Surprised" might accidentally make them look "Fearful" because the model hasn't learned the boundary between these overlapping features.
  2. Identity Drift: Increasing the intensity of an emotion often warps the person's facial structure so much they become unrecognizable.

Semantic Overlap Observation

Methodology: Precision Through Symmetry

PixelSmile breaks these limitations through three core innovations:

1. Continuous Supervision (FFE Dataset)

The authors built the Flex Facial Expression (FFE) dataset. Instead of labeling an image as just "Angry," they provide a 12-dimensional vector of scores. This allows the model to understand that a face can be 80% Angry and 20% Disgusted simultaneously, reflecting real-world complexity.

2. Fully Symmetric Joint Training

To solve the "confusion" problem, the model is trained on triplets of images. If it's learning "Fear" and "Surprise," it treats the correct "Fear" image as a positive and the "Surprise" image as a hard negative. The training is "fully symmetric," meaning both emotions in a confusing pair are optimized simultaneously to ensure the latent space truly pushes them apart.

PixelSmile Architecture

3. Textual Latent Interpolation

Instead of relying on image-to-image reference, PixelSmile manipulates the textual embedding space. By calculating a "residual direction" between a neutral prompt and a target expression, the model can navigate the intensity of an emotion using a simple coefficient $\alpha$.

Experimental Results: SOTA Disentanglement

In the newly established FFE-Bench, PixelSmile dominates both general-purpose editors (like GPT-Image) and specialized control models (like SliderEdit).

  • Structural Disentanglement: PixelSmile achieved an mSCR of 0.055. For context, most open-source models like FLUX or Qwen-Edit score above 0.20, showing significantly more confusion.
  • Identity Fidelity: While most models see a "crash" in identity similarity as expression intensity increases, PixelSmile maintains a stable identity within the 0.6-0.7 range, even when pushing expressions to their extreme.

Performance Comparison

Critical Insight: Beyond Single Emotions

One of the most fascinating capabilities of PixelSmile is Expression Blending. Because the model learns a continuous and well-disentangled manifold, users can perform zero-shot "emotion mixing." By interpolating between "Happy" and "Surprised," the model generates a perceptually coherent "Ecstatic" expression—something that rigid categorical models fail to achieve.

Conclusion

PixelSmile represents a shift in facial editing from "categorical swapping" to "manifold navigation." By combining metric learning (contrastive loss) with generative modeling (diffusion) and high-quality continuous data, it provides the most precise control over facial affect to date.

Future Outlook: While PixelSmile excels at 12 categories, scaling this to the thousands of micro-expressions found in human interaction remains the next frontier. The reliance on VLMs like Gemini 3 Pro for annotation also highlights the industry's shift toward "AI-supervising-AI" for dataset creation.

发现相似论文

试试这些示例

  • Find recent papers from 2024-2025 that address semantic entanglement in diffusion-based attribute editing using contrastive learning or symmetric objectives.
  • Which study first introduced the concept of using continuous affective scores instead of one-hot labels for facial expression recognition, and how does FFE-Bench build upon its evaluation metrics?
  • Search for research that applies Flow-Matching-based latent interpolation to other fine-grained image editing tasks such as age progression or lighting manipulation.
目录
[CVPR 2025] PixelSmile: Mastering the Continuous Manifold of Facial Expressions
1. TL;DR
2. The Problem: The "Semantic Overlap" Trap
3. Methodology: Precision Through Symmetry
3.1. 1. Continuous Supervision (FFE Dataset)
3.2. 2. Fully Symmetric Joint Training
3.3. 3. Textual Latent Interpolation
4. Experimental Results: SOTA Disentanglement
5. Critical Insight: Beyond Single Emotions
6. Conclusion