The paper introduces Speculative Speculative Decoding (SSD) and its optimized implementation, Saguaro. This method parallelizes the traditionally sequential drafting and verification steps of speculative decoding, achieving up to 2x speedup over standard speculative decoding and 5x over autoregressive baselines for models like Llama-3.1-70B.
TL;DR
While standard Speculative Decoding (SD) uses a small model to guess tokens for a large one, it still forces the two to work in a "wait-your-turn" sequential loop. Speculative Speculative Decoding (SSD), and its reference implementation Saguaro, breaks this loop. By predicting verification outcomes while the target model is still busy, SSD hides drafting latency entirely. The result? A 2x speed boost over standard speculative decoding and a 5x leap over traditional autoregressive generation.
The Bottleneck: The "Wait-and-See" Problem
Standard speculative decoding is a game of "Draft -> Verify -> Repeat." Even though the verification happens in parallel for a chunk of tokens, the Draft model cannot start its next guess until it knows exactly where the Verifier left off.
This creates a "bubble" of idle time. If your draft model takes 20ms and your verifier takes 80ms, you are stuck in a 100ms cycle. SSD asks: What if the draft model started guessing the next round while the verifier was still working on the current one?
Methodology: How to Predict the Future?
The challenge with asynchronous drafting is that the draft model doesn't know the verification outcome (how many tokens the target model accepted and what "bonus" token it sampled). SSD solves this by building a Speculation Cache.
1. The Saguaro Cache & Geometric Fan-out
Since it's impossible to prepare for every possible outcome (), Saguaro uses a Geometric Fan-out strategy. Based on the intuition that shorter acceptance lengths are more common, the algorithm allocates more "guesses" (fan-out) to the positions most likely to be the rejection point.
Figure: Comparing standard SD (Left) where the verifier waits, and SSD (Center) where the draft precomputes multiple outcomes in parallel.
2. Saguaro Sampling: Making Rejections Predictable
In speculative decoding, when a token is rejected, the target model samples a "bonus token" from a residual distribution. This distribution is usually hard to predict. Saguaro introduces a novel sampling constant to downweight certain tokens in the draft. This "paints" the residual distribution into a shape that is easier for the speculator to guess, drastically increasing the Cache Hit Rate.
Experimental Performance: Pushing the Frontier
The authors tested Saguaro on Llama-3.1-70B using 4xH100 GPUs for the target and 1xH100 for the draft.
- Speed: Saguaro reached 255.8 tok/s on Llama-3.1-70B, compared to 161.8 tok/s for standard SD and 54.7 tok/s for autoregressive (AR) decoding.
- Efficiency: Unlike many speculative methods that kill throughput to save latency, SSD actually improves the throughput-latency Pareto frontier, meaning it provides more "bang for the buck" across various batch sizes.
Figure: SSD (Saguaro) consistently outperforms SD and AR across multiple datasets like HumanEval and GSM8K.
Deep Insight: A New Hardware Paradigm
The most profound shift in SSD is the requirement for distinct hardware. In standard SD, the draft and target models usually share the same GPU to avoid communication overhead. SSD deliberately separates them. Because the communication (sending a few token IDs and logits) is tiny compared to the heavy lifting of a 70B forward pass, the asynchrony more than pays for the NCCL transfer time.
Limitations & Future Work
While SSD is a breakthrough, it relies on a "Primary" and "Backup" speculator strategy to handle Cache Misses. At very large batch sizes (e.g., >32), the probability of at least one sequence missing the cache increases, which can bottle-neck the entire batch. Future iterations may focus on a more "disaggregated" prefill-decode architecture where speculation endpoints are shared across clusters.
Conclusion
Saguaro proves that the sequential dependence between drafting and verification is not a fundamental law of LLM inference, but a design choice. By treating the verifier's outcome as a predictable event, SSD opens the door to a new era of ultra-low latency AI.
Main Takeaway: If you have an extra GPU to spare for drafting, SSD/Saguaro is currently the most efficient way to turn that extra silicon into raw generation speed.
