WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
GaussianGPT: Redefining 3D Scene Generation via Autoregressive Next-Token Prediction
总结
问题
方法
结果
要点
摘要

GaussianGPT is a novel autoregressive transformer framework for 3D scene generation that treats 3D space as a sequence of discrete tokens. By combining a sparse 3D convolutional autoencoder with a causal GPT-style transformer, it achieves state-of-the-art results in unconditional shape synthesis and enables flexible indoor scene completion and infinite outpainting.

TL;DR

GaussianGPT shifts the 3D generation paradigm from "global denoising" (Diffusion) to "sequential construction" (Autoregressive). By tokenizing 3D Gaussian primitives and feeding them into a GPT-style transformer, the model can "write" 3D scenes step-by-step, enabling seamless scene completion, infinite outpainting, and state-of-the-art object synthesis.

Background: The Limits of Holistic Denoising

In the current generative landscape, Diffusion Models (DDPM) and Flow Matching rule the 3D domain. While they produce high-fidelity results, they treat a 3D scene as a single static block to be refined. However, real-world environments are built incrementally—adding a chair, extending a hallway, or filling a corner. GaussianGPT argues that 3D generation should mirror this compositional nature, treating 3D space as a structured sequence.

Methodology: Tokenizing the 3D World

The core challenge of autoregressive 3D generation is converting unstructured 3D Gaussians into a format a Transformer can understand. The authors propose a three-stage pipeline:

  1. Scene Compression: A sparse 3D convolutional autoencoder maps Gaussian primitives (position, opacity, rotation, color) into a discrete latent grid.
  2. Lookup-Free Quantization (LFQ): Unlike standard VQ-VAE, LFQ improves codebook utilization, ensuring every "token" in the vocabulary is used effectively to represent geometric details.
  3. 3D-Aware Transformer: The serialized tokens are processed by a causal transformer. To prevent the model from getting "lost" in the 1D serialization of 3D space, researchers implemented 3D Rotary Positional Embeddings (RoPE). This allows the attention mechanism to calculate relationships based on actual (x, y, z) proximity rather than sequence distance.

Overall Architecture Figure 1: The GaussianGPT pipeline—transforming raw Gaussians into discrete tokens for autoregressive modeling.

Experiments: From Objects to Infinite Scenes

The model was evaluated on both the PhotoShape (objects) and 3D-FRONT (indoor scenes) datasets.

  • Unconditional Shape Synthesis: GaussianGPT achieved a significant lead in visual fidelity metrics (FID/KID) compared to diffusion-based baselines like L3DG.
  • Infinite Outpainting: Because it is autoregressive, the model can generate a scene, then use the end of that scene as a "prompt" to generate the next section. This results in coherent, large-scale environments that aren't limited by a fixed bounding box.

Experimental Results Comparison Table 1: GaussianGPT outperforms existing methods in FID and KID metrics while maintaining high diversity (COV).

Deep Insight: Why xyz Ordering Works

A fascinating finding in the paper is that a simple xyz traversal (column-wise serialization) performs just as well as complex space-filling curves like Hilbert or Z-order. The authors attribute this to the 3D RoPE—when the transformer knows the true 3D coordinates, the specific order of the "alphabet" becomes less critical, as the "grammar" of the 3D space is handled by the attention anchors.

Scene Completion Visualization Figure 2: The model naturally handles sparse input (left) to generate plausible, diverse completed scenes (right).

Conclusion & Future Outlook

GaussianGPT proves that the "LLM for 3D" approach is not just a curiosity—it is a competitive paradigm. By treating 3D primitives like words in a sentence, we unlock a level of interactive control (completion, editing, and expansion) that is difficult to achieve with holistic diffusion models.

Limitations: The model currently struggles with extremely high-frequency details in real-world scans (like ScanNet++), as the compression bottleneck of the autoencoder can introduce noise. Future iteration on the 3D tokenizer will be the key to moving from synthetic "sim-to-real" generation.

发现相似论文

试试这些示例

  • Search for recent papers that apply autoregressive transformers to 3D Gaussian Splatting beyond GaussianGPT.
  • Which study first introduced Lookup-Free Quantization (LFQ) for visual tokenization, and how does it compare to standard Vector Quantization in 3D tasks?
  • Explore research using 3D Rotary Positional Embeddings (RoPE) in other 3D generative architectures like Point Transformers or MeshGPT.
目录
GaussianGPT: Redefining 3D Scene Generation via Autoregressive Next-Token Prediction
1. TL;DR
2. Background: The Limits of Holistic Denoising
3. Methodology: Tokenizing the 3D World
4. Experiments: From Objects to Infinite Scenes
5. Deep Insight: Why xyz Ordering Works
6. Conclusion & Future Outlook