GaussianGPT is a novel autoregressive transformer framework for 3D scene generation that treats 3D space as a sequence of discrete tokens. By combining a sparse 3D convolutional autoencoder with a causal GPT-style transformer, it achieves state-of-the-art results in unconditional shape synthesis and enables flexible indoor scene completion and infinite outpainting.
TL;DR
GaussianGPT shifts the 3D generation paradigm from "global denoising" (Diffusion) to "sequential construction" (Autoregressive). By tokenizing 3D Gaussian primitives and feeding them into a GPT-style transformer, the model can "write" 3D scenes step-by-step, enabling seamless scene completion, infinite outpainting, and state-of-the-art object synthesis.
Background: The Limits of Holistic Denoising
In the current generative landscape, Diffusion Models (DDPM) and Flow Matching rule the 3D domain. While they produce high-fidelity results, they treat a 3D scene as a single static block to be refined. However, real-world environments are built incrementally—adding a chair, extending a hallway, or filling a corner. GaussianGPT argues that 3D generation should mirror this compositional nature, treating 3D space as a structured sequence.
Methodology: Tokenizing the 3D World
The core challenge of autoregressive 3D generation is converting unstructured 3D Gaussians into a format a Transformer can understand. The authors propose a three-stage pipeline:
- Scene Compression: A sparse 3D convolutional autoencoder maps Gaussian primitives (position, opacity, rotation, color) into a discrete latent grid.
- Lookup-Free Quantization (LFQ): Unlike standard VQ-VAE, LFQ improves codebook utilization, ensuring every "token" in the vocabulary is used effectively to represent geometric details.
- 3D-Aware Transformer: The serialized tokens are processed by a causal transformer. To prevent the model from getting "lost" in the 1D serialization of 3D space, researchers implemented 3D Rotary Positional Embeddings (RoPE). This allows the attention mechanism to calculate relationships based on actual (x, y, z) proximity rather than sequence distance.
Figure 1: The GaussianGPT pipeline—transforming raw Gaussians into discrete tokens for autoregressive modeling.
Experiments: From Objects to Infinite Scenes
The model was evaluated on both the PhotoShape (objects) and 3D-FRONT (indoor scenes) datasets.
- Unconditional Shape Synthesis: GaussianGPT achieved a significant lead in visual fidelity metrics (FID/KID) compared to diffusion-based baselines like L3DG.
- Infinite Outpainting: Because it is autoregressive, the model can generate a scene, then use the end of that scene as a "prompt" to generate the next section. This results in coherent, large-scale environments that aren't limited by a fixed bounding box.
Table 1: GaussianGPT outperforms existing methods in FID and KID metrics while maintaining high diversity (COV).
Deep Insight: Why xyz Ordering Works
A fascinating finding in the paper is that a simple xyz traversal (column-wise serialization) performs just as well as complex space-filling curves like Hilbert or Z-order. The authors attribute this to the 3D RoPE—when the transformer knows the true 3D coordinates, the specific order of the "alphabet" becomes less critical, as the "grammar" of the 3D space is handled by the attention anchors.
Figure 2: The model naturally handles sparse input (left) to generate plausible, diverse completed scenes (right).
Conclusion & Future Outlook
GaussianGPT proves that the "LLM for 3D" approach is not just a curiosity—it is a competitive paradigm. By treating 3D primitives like words in a sentence, we unlock a level of interactive control (completion, editing, and expansion) that is difficult to achieve with holistic diffusion models.
Limitations: The model currently struggles with extremely high-frequency details in real-world scans (like ScanNet++), as the compression bottleneck of the autoencoder can introduce noise. Future iteration on the 3D tokenizer will be the key to moving from synthetic "sim-to-real" generation.
