WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[Technion 2025] The Universal Normal Embedding: Unifying Generation and Encoding through Gaussian Geometry
总结
问题
方法
结果
要点
摘要

The paper proposes the Universal Normal Embedding (UNE) hypothesis, asserting that vision encoders and generative models share a common, approximately Gaussian latent geometry. By introducing the NoiseZoo dataset, the authors demonstrate that DDIM-inverted diffusion noise contains linearly separable semantic information comparable to SOTA encoders like CLIP and DINO.

TL;DR

Is "noise" truly random, or is it a structured map of human semantics? This paper introduces the Universal Normal Embedding (UNE) hypothesis, proving that the inverted noise of diffusion models (like Stable Diffusion) and the embeddings of vision encoders (like CLIP) are simply different "views" of the same Gaussian latent space. By treating noise as a linear projection of this shared space, the authors enable precise, training-free semantic image editing and demonstrate that generative noise is nearly as "smart" as specialized semantic encoders.

The Hidden Geometry of Deep Learning

For years, the AI community has treated Generative Models and Vision Encoders as two separate species. One is a master of creation (mapping noise to pixels), the other a master of understanding (mapping pixels to semantic vectors).

However, the authors observe a striking convergence:

  1. Generative Models start with Gaussian noise.
  2. Encoders produce embeddings that empirically follow a Gaussian distribution.

This is not a coincidence. The UNE Hypothesis suggests there exists an underlying, ideal Gaussian space where every semantic attribute (e.g., "Smiling", "Young", "Wearing Glasses") is represented as a simple linear direction.

Methodology: Noise as a Semantic Map

The core of the paper lies in the Induced Normal Embedding (INE). If the UNE is the "ideal" map, then each model (CLIP, DINO, SD 1.5) provides a model-specific, noisy linear projection of that map.

1. The NoiseZoo Dataset

To test this, the authors created NoiseZoo. They took the CelebA dataset and, for every image, extracted:

  • Inverted Noise: Using DDIM inversion to find the specific starting noise that generates that exact image.
  • Encoder Embeddings: Standard vectors from CLIP and DINO.

2. Probing and Editing

If the hypothesis holds, we should be able to "find" a person's smile in their starting noise vector. By training a simple linear classifier (Logistic Regression) on the noise vectors, the authors found that semantic attributes are linearly separable even in pure diffusion noise.

UNE Conceptual Illustration Figure 1: Different models like CLIP and Stable Diffusion provide different views of the same underlying Gaussian structure.

Experimental Breakthroughs: Editing Without Training

One of the most impressive results is Training-Free Semantic Editing. Usually, changing a person's age in a generated image requires complex prompt engineering or ControlNet. Under the UNE framework, you simply:

  1. Find the "Age" direction $w$ in the noise space.
  2. Shift the noise: $z_{new} = z_{old} + \alpha w$.
  3. Run the diffusion process.

Disentanglement through Orthogonality

Often, editing "Goatee" might accidentally change "Face Shape" because the features are correlated. The authors solve this using linear algebra: by projecting the "Goatee" vector into the null space of the "Face Shape" vector, they achieve clean, isolated edits.

Linear Editing Results Figure 2: Moving along linear directions in the noise space allows for smooth, local, and controllable attribute changes.

Evidence: The Gaussianity Test

The authors performed rigorous statistical tests (Anderson-Darling, D’Agostino-Pearson) on the latent dimensions of various models.

  • Generative Models: ~95% of latent dimensions are Gaussian.
  • Encoders: ~90% are Gaussian.
  • Alignment: The accuracy of linear probes on noise is highly correlated with the accuracy of probes on CLIP embeddings (as seen in Figure 3 of the paper).

Probing Comparison Figure 3: Semantic information is linearly accessible across diverse models, with DDIM noise (bottom) matching the performance of semantic encoders (top).

Critical Insight & Conclusion

The Universal Normal Embedding offers a bridge between the Platonic Representation Hypothesis (that all models learn the same "reality") and the practical mechanics of diffusion. It suggests that the "randomness" we use to seed our generators is actually a highly organized semantic manifold.

Takeaway for the Industry:

  • Interoperability: If latents are just linear transforms of each other, we can "stitch" a CLIP encoder to a Stable Diffusion generator with a simple matrix multiplication.
  • Efficiency: We can perform semantic search and editing directly in the compressed noise space, bypassing heavy inference or fine-tuning.

While the paper focuses on images, the UNE hypothesis potentially extends to Audio and Video, promising a "Universal Language" of Gaussian latents across all of AI.

发现相似论文

试试这些示例

  • Search for recent papers exploring the "Platonic Representation Hypothesis" and how it applies to large-scale multimodal models beyond images.
  • Which studies first identified the linear separability of GAN latent spaces (e.g., StyleGAN's W-space) and how does the UNE hypothesis extend these findings to diffusion noise?
  • Research the application of Generalized Canonical Correlation Analysis (GCCA) for aligning latent spaces in self-supervised learning architectures.
目录
[Technion 2025] The Universal Normal Embedding: Unifying Generation and Encoding through Gaussian Geometry
1. TL;DR
2. The Hidden Geometry of Deep Learning
3. Methodology: Noise as a Semantic Map
3.1. 1. The NoiseZoo Dataset
3.2. 2. Probing and Editing
4. Experimental Breakthroughs: Editing Without Training
4.1. Disentanglement through Orthogonality
5. Evidence: The Gaussianity Test
6. Critical Insight & Conclusion