The paper presents GR4AD (Generative Recommendation ADvertising), a production-oriented generative recommendation system deployed at Kuaishou. It introduces a co-designed architecture featuring Unified Advertisement Semantic IDs (UA-SID) and a Lazy Autoregressive (LazyAR) decoder, achieving a 4.2% ad revenue increase and 100ms real-time serving latency for 400 million users.
TL;DR
Kuaishou has successfully transitioned its massive advertising stack from traditional DLRMs to a generative paradigm. GR4AD introduces a "recommendation-native" generative design that moves past simple "next-token prediction." By leveraging UA-SID for multi-modal ad representation, a LazyAR decoder for high-throughput inference, and RSPO for value-aware reinforcement learning, the system achieved a 4.2% revenue increase and 10.17% CVR boost while maintaining sub-100ms latency for 400 million users.
Problem & Motivation: The "LLM Gap" in Advertising
While Generative Recommendation (GenRec) is the new frontier, the industry has struggled to deploy it in real-time advertising. Standard LLM recipes fail here for three reasons:
- Semantic Collision: Ads are more than just "text." Identical videos might target different conversion goals (e.g., "purchase" vs. "app install"), creating collisions in standard semantic ID spaces.
- Point-wise vs. List-wise: LLMs learn to predict the next token, but ad platforms need to optimize the eCPM of a ranked list.
- The Inference Tax: Autoregressive decoding is notoriously slow. Serving hundreds of candidates per request under a 100ms budget is a nightmare for standard Transformer decoders.
Methodology: The Core Innovations
1. UA-SID: Beyond Basic Embeddings
To solve the tokenization problem, GR4AD uses Unified Advertisement Semantic IDs. They fine-tune a Multi-modal LLM (MLLM) on ad-specific instructions and use Co-occurrence Learning to inject collaborative filtering signals into the IDs.
To reduce collisions, they use Multi-Granularity-Multi-Resolution (MGMR) RQ-Kmeans. This allows earlier levels of the ID to capture high-level semantics while final levels use hash-based numeric mapping for business-specific signals.

2. LazyAR: Engineering the Latency Breakthrough
The "Aha!" moment in this paper is the LazyAR (Lazy Autoregressive) Decoder. The authors noticed that the first token of a Semantic ID carries the most weight, but later tokens are cheaper to compute if you relax the rules.
Instead of full autoregression, LazyAR computes the first layers in parallel (shared across all beams) and only does the final layers autoregressively. This effectively doubles the inference throughput (QPS) without hurting recommendation quality.

3. RSPO: Aligning with Business Value
Standard Cross-Entropy isn't enough. GR4AD introduces RSPO (Ranking-Guided Softmax Preference Optimization). It’s an RL-based approach that treats the candidate list as a whole, optimizing for an upper bound of NDCG, specifically weighted by business value (eCPM). This ensures the model isn't just "predicting what's next" but "predicting what's valuable."
Experiments & Results: Scaling Laws are Real
The results from Kuaishou’s production environment are definitive:
- Revenue Growth: Up to +4.32% relative to the GR-Base.
- Efficiency: LazyAR + Serving optimizations reached 117% QPS improvement over vanilla generative setups.
- Scaling Laws: The authors observed a linear relationship between revenue and both model size and inference beam width.

Critical Insight & Conclusion
GR4AD proves that the future of recommendation is generative, but only if we stop treating recommenders as "text completion engines." The success here stems from architectural relaxation (LazyAR) and metric alignment (RSPO).
Takeaway: If you are scaling GenRec, don't just scale the parameters. Scale the inference-time search (beam width) and optimize the sharing of hidden states across beams.
Limitations
- The LazyAR design is specific to short-sequence generation (Semantic IDs) and likely won't translate to long-form LLM text generation.
- The reliance on a Reward Model for RL signal means the upper bound of performance is still capped by the quality of the teacher reward model.
