LoST: Level of Semantics Tokenization for 3D Shapes

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

LoST: Level of Semantics Tokenization for 3D Shapes

[CVPR 2025] LoST: Transforming 3D Generation by Prioritizing Semantics over Geometry

Summary

Problem

Method

Results

Takeaways

Abstract

LoST (Level of Semantics Tokenization) is a novel 3D shape tokenization method that orders tokens by semantic salience rather than geometric detail. It allows Autoregressive (AR) models to decode complete, plausible shapes from just a few tokens, setting a new SOTA in 3D generation.

TL;DR

LoST (Level of Semantics Tokenization) is a paradigm shift in how we represent 3D shapes for AI. Instead of breaking a shape down into smaller and smaller geometric boxes (Level-of-Detail), it breaks it down by "meaning" (Level-of-Semantics). This allows a model to understand a "chair" from its first token, using subsequent tokens only to refine the specific style or texture. The result? SOTA 3D generation using 90-99% fewer tokens.

Problem: The "Token Bloat" of Geometric Hierarchies

Traditional 3D AR models treat shapes like zip files. They use Octrees or meshes to define space. If you stop the generation early, you get a "cloud" or a "scaffold" that looks nothing like the final object. This is geometrically logical but semantically nonsensical.

For an Autoregressive (AR) model, this is a nightmare:

High Perplexity: The model spends thousands of tokens defining "empty space" or "coarse blocks" before getting to the actual object.
Inflexible Decoding: You can't stop early to save compute; the intermediate results are useless.

Methodology: Semantics First, Geometry Later

LoST introduces a Level-of-Semantics (LoS) hierarchy. The goal is simple: the first token should represent the most "salient" part of the shape (e.g., the essence of a car), and the 512th token should represent a tiny scratch on the fender.

The RIDA Innovation

To train this, the authors needed a way to tell the model what "semantics" are. While we have DINO for images, we don't have a "DINO for 3D." The authors proposed Relational Inter-Distance Alignment (RIDA).

Instead of trying to force 3D features to look like 2D pixels (which fails due to dimensionality mismatch), RIDA aligns the relationships between objects. If a "Sports Car" is semantically close to a "Sedan" but far from a "Tree" in the image domain (DINO), RIDA ensures the 3D features maintain that same distance in the 3D latent space.

Model Architecture Figure: Overview of LoST. Left: Mapping 3D latents to semantic sequences. Right: RIDA pre-training for semantic Extractors.

Experiments: Doing More with Less

The results are staggering. In 3D reconstruction benchmarks, LoST with just 4 tokens often outperforms baseline models (like OctGPT) using thousands of tokens.

For Autoregressive generation, the LoST-GPT model uses only 128 tokens. Compared to Llama-Mesh or ShapeLLM-Omni, it achieves significantly better FID scores (a measure of generation quality) while being orders of magnitude faster and more efficient.

SOTA Comparison Figure: LoST vs. LoD baselines. Notice how the 1st token of LoST already looks like a recognizable shape.

Critical Analysis & Conclusion

LoST proves that for 3D generative AI, how we tokenize is as important as the model architecture itself. By aligning 3D latents with proven 2D semantic foundations (DINO), the authors bypassed the need for massive 3D-native semantic datasets.

Limitations:

The method currently relies on a Diffusion decoder to turn tokens back into 3D shapes, which adds a computational step.
While it handles geometry brilliantly, expanding this to high-fidelity "Gaussian Splatting" or complex textures is the next frontier.

The Takeaway: If you want to build the "LLM for 3D," stop thinking in voxels and start thinking in semantics.

Find Similar Papers

Try Our Examples

Search for recent papers that apply Nested Dropout or Matryoshka Representation Learning to 3D generative modeling beyond triplane representations.
Identify the origin of the REPA (Representation Alignment) loss and analyze how subsequent works have adapted it for non-image modalities like 3D or Audio.
Find research exploring the integration of semantic-first tokenizers, similar to LoST, into Multimodal Large Language Models (MLLMs) for native 3D understanding and synthesis.

Contents

[CVPR 2025] LoST: Transforming 3D Generation by Prioritizing Semantics over Geometry

1. TL;DR

2. Problem: The "Token Bloat" of Geometric Hierarchies

3. Methodology: Semantics First, Geometry Later

3.1. The RIDA Innovation

4. Experiments: Doing More with Less

5. Critical Analysis & Conclusion