QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

[QUITOBENCH] Beyond Domain Labels: A Billion-Scale, Regime-Balanced Benchmark for the Foundation Model Era

总结

问题

方法

结果

要点

摘要

QUITOBENCH is a billion-scale, regime-balanced open benchmark for time series forecasting, built from the single-provenance Alipay "QUITO" corpus. It categorizes series into eight "Trend × Seasonality × Forecastability" (TSF) regimes to evaluate models based on intrinsic data properties, achieving a new SOTA evaluation standard for both deep learning and foundation models.

TL;DR

The time series forecasting field is facing an "evaluation crisis" due to data leakage and skewed benchmarks. QUITOBENCH introduces a massive, contamination-free dataset from Alipay, shifting the focus from application-domain labels (like "Finance" or "Traffic") to intrinsic statistical TSF Regimes (Trend, Seasonality, Forecastability). The results challenge the "bigger is better" dogma, showing that 1M-parameter models can beat 200M-parameter giants in specific settings.

The Problem: The "Traffic" Trap

In traditional benchmarks like GIFT-Eval or Timer, datasets are grouped by domain. However, a "traffic" series can be a smooth diurnal curve or a chaotic burst of events. This coarse grouping hides the real drivers of forecasting difficulty. Furthermore, existing benchmarks are often imbalanced, where 76% of the data might fall into a single "easy" category, allowing models to "rank-squat" without truly being robust.

Methodology: The TSF Taxonomy

To solve this, the authors propose the TSF Regime. Every series is analyzed for:

Trend (T): Long-term drift (via STL decomposition).
Seasonality (S): Periodic patterns (via STL decomposition).
Forecastability (F): Signal regularity (1 - Normalized Spectral Entropy).

By binarizing these (High/Low), they create 8 distinct cells (e.g., HIGH_LOW_LOW is the "pathological" regime with high drift but high noise). QUITOBENCH ensures a near-uniform distribution across these cells, forcing models to prove their mettle across all conditions.

Overall Pipeline of QUITOBENCH

Key Insight 1: The Context-Length Crossover

One of the most profound findings is that the "winner" changes with the amount of history provided.

At Short Context (L=96): Deep Learning models (like CrossFormer) dominate, fitting local dependencies with high precision.
At Long Context (L≥576): Foundation Models (Chronos-2, TimesFM) take the lead. Their pre-trained "world model" of time series allows them to exploit recurring motifs and long-term seasonality that smaller models miss.

Key Insight 2: Parameter Efficiency is Not Dead

Despite the hype around 200M+ parameter foundation models, the paper shows an "Efficiency Frontier." Deep leaning models with ~1M parameters (like CrossFormer) are 59x more parameter-efficient, often matching or surpassing the MAE of foundation models.

Efficiency Frontier: Rank vs Model Scale

Experiments and Results

Across 232,200 evaluation instances, CrossFormer emerged as the overall champion in MAE and Rank. However, the HIGH_LOW_LOW regime remains a "stress test" where all current models fail significantly—3.64x worse performance than the easiest regime. This indicates that while we have made progress in seasonal data, "turbulent" trend-dominated noise is still an open frontier.

MAE vs MSE Rankings Consistency

Critical Analysis & Conclusion

QUITOBENCH proves that single-provenance data (from one large platform like Alipay) can actually be more diverse than a collection of public datasets if sampled correctly across TSF regimes.

Takeaway for Practitioners:

Don't just scale the model: Scaling the training data tokens is far more effective than increasing parameter counts.
Context Matters: If your system only provides short historical windows, stick to optimized Deep Learning models like CrossFormer or PatchTST. If you have years of historical data, Foundation Models will finally earn their keep.

Limitations: The data is currently anonymized for commercial sensitivity, and while it covers 9 verticals, it may still reflect the specific "gravity" of a massive fintech ecosystem.

Future work will likely look at probabilistic forecasting and integrating these TSF-aware models into automated CloudOps pipelines.

发现相似论文

试试这些示例

Search for recent papers that utilize spectral entropy or STL decomposition to quantify the "forecastability" of time series datasets for benchmarking purposes.
Identify the foundational research on "Cross-Dimension Dependency" in Transformers (such as CrossFormer) and how it compares to "Patching" techniques used in PatchTST.
Find studies exploring the application of zero-shot time series foundation models (like Chronos or TimesFM) in multi-vertical industrial IoT or cloud infrastructure monitoring.

[QUITOBENCH] Beyond Domain Labels: A Billion-Scale, Regime-Balanced Benchmark for the Foundation Model Era

1. TL;DR

2. The Problem: The "Traffic" Trap

3. Methodology: The TSF Taxonomy

4. Key Insight 1: The Context-Length Crossover

5. Key Insight 2: Parameter Efficiency is Not Dead

6. Experiments and Results

7. Critical Analysis & Conclusion