Do Sparse Autoencoders Capture Concept Manifolds?

学术搜索

学术问答

价格

TrueCite

Do Sparse Autoencoders Capture Concept Manifolds?

Beyond Directions: Why Sparse Autoencoders Fragment the Manifolds of Thought

总结

问题

方法

结果

要点

摘要

This paper introduces a theoretical and empirical framework to evaluate whether Sparse Autoencoders (SAEs) capture nonlinearly curved global structures called "concept manifolds." By analyzing Llama 3.1-8B representations, the authors demonstrate that SAEs typically recover these structures through "dilution"—a fragmented tiling of the manifold across many localized features—rather than compact linear directions.

TL;DR

The "Linear Representation Hypothesis"—the idea that AI models store concepts as neat, straight lines—might be an oversimplification. This paper proves that LLMs actually represent concepts as curved, multi-dimensional manifolds. While Sparse Autoencoders (SAEs) can see these structures, they don't capture them as unified objects; instead, they "tile" them into a fragmented mosaic of local detectors, a phenomenon the authors call Dilution.

Background: The Limits of the Straight Line

For years, mechanistic interpretability has relied on the assumption that if you find the right direction in an LLM's activation space, you’ve found a "concept" (e.g., "Paris" or "Sentiment"). However, if you look at how a model represents the "Days of the Week," it doesn't look like seven separate sticks. It looks like a circle. Colors look like a paraboloid.

If our tools (SAEs) assume concepts are linear but the reality is curved, we get interventional brittleness: trying to steer a model by a single feature direction often fails because you're pushing the representation off the curved surface it lives on.

Methodology: The Three Regimes of SAE Capture

The authors propose that SAEs represent these curved manifolds in one of three ways:

Subspace Capture: A few atoms perfectly span the subspace containing the manifold.
Shattering: Features become hyper-specialized, with each atom covering a tiny, non-overlapping patch (like "Tuesday at 4 PM" instead of "Tuesday").
Dilution: Many redundant features overlap to cover the same geometry. This is where current SAEs (TopK, JumpReLU, etc.) actually live.

The Ising Pipeline for Discovery

Because individual features are fragmented, the authors used an Ising Model (borrowed from statistical physics) to find "feature communities." Instead of asking "do these features point in the same direction?", they ask "do these features fire together or inhibit each other predictably?"

Model Architecture and Manifold Geometry Figure 1: PCA projections of Llama 3.1-8B show that concepts like Color and Temperature are organized as smooth geometric manifolds, while steering interventions show these structures are causally functional.

Experiments and Results: The Dilution Reality

The authors tested five different SAE architectures on Llama 3.1-8B. They found that to reconstruct a manifold with an ambient dimension of, say, 2, an SAE often used dozens of features.

Tuning Curves: Features displayed "population coding" similar to biological neurons. A single "Year" feature might fire every 10 years (detecting the 'ones' digit), tiling the temporal manifold.
Unsupervised Success: The Ising-based discovery tool didn't just find known manifolds; it discovered a new one for "epistemic uncertainty" in scientific text—grouping features that detect measurement error and imprecision.

Experimental Evidence of Tiling Figure 2: Turning curves in SAEs. Note how features tile the "years" manifold with localized, periodic activation profiles resembling neural population codes.

Critical Analysis: Why This Matters for the Future

The current state of interpretability suffers from a "mismatch between featurizers and geometry." We are using 1D probes to analyze 3D (or ND) shapes.

Key Takeaways:

Steering: We should stop steering single features and start steering "feature groups" or "subspaces."
Stability: The "instability" of SAEs (learning different features across runs) is actually a feature, not a bug—there are many ways to tile the same manifold.
New Architectures: We need next-generation SAEs that reward the recovery of geometric objects, not just sparse directions.

Conclusion

This work marks a shift from a "dictionary of words" view of LLMs to a "geometry of thought." If we want to truly understand how models reason, we must start thinking in manifolds, not just vectors.

Discovery Pipeline Results Figure 3: Contrast between decoder similarity (which fails to group related concepts) and the Ising/Conditional Co-activation metrics which reveal clear manifold clusters.

发现相似论文

试试这些示例

Search for recent papers that propose non-linear extensions to the Linear Representation Hypothesis (LRH) in Large Language Models.
Which original research first established the use of Sparse Autoencoders for mechanistic interpretability, and how did they address the potential for multi-dimensional concept structures?
Find studies that apply Manifold Learning techniques, such as Isomap or Locally Linear Embedding (LLE), to analyze the internal activations of Transformer-based models.

目录

Beyond Directions: Why Sparse Autoencoders Fragment the Manifolds of Thought

1. TL;DR

2. Background: The Limits of the Straight Line

3. Methodology: The Three Regimes of SAE Capture

3.1. The Ising Pipeline for Discovery

4. Experiments and Results: The Dilution Reality

5. Critical Analysis: Why This Matters for the Future

6. Conclusion