Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

[arXiv 2026] Timer-S1: Decoding the Serial Scaling Law for Billion-Scale Time Series Foundation Models

总结

问题

方法

结果

要点

摘要

Timer-S1 is a billion-scale time series foundation model featuring 8.3B total parameters and a Mixture-of-Experts (MoE) architecture. It introduces "Serial Scaling" through a novel Serial-Token Prediction (STP) objective and TimeSTP blocks, achieving SOTA performance on the GIFT-Eval leaderboard.

TL;DR

Timer-S1 marks a milestone in the evolution of Time Series Foundation Models (TSFMs). By scaling to 8.3 Billion parameters (32 experts, 2 activated) and introducing Serial-Token Prediction (STP), it breaks the efficiency-accuracy trade-off inherent in traditional autoregressive models. Unlike prior works that either roll step-by-step (slow) or predict in parallel (low precision for long-term), Timer-S1 utilizes specialized TimeSTP blocks to provide deep serial computation in a single inference pass.

Context: The Scaling Bottleneck in Time Series

While Language Models (LLMs) have scaled gracefully using Next-Token Prediction (NTP), Time Series data presents a different beast. It is non-stationary, heterogeneous, and lacks the discrete grammatical structure of language.

The industry has been stuck between two worlds:

Iterative Autoregression (NTP): Conceptually sound but suffers from "Exposure Bias" and massive computational overhead during long-term inference.
Parallel Multi-Token Prediction (MTP): Fast, but because it predicts all future points from the same representation layer, it lacks the "progressive reasoning" required to model how error and uncertainty evolve over time.

Timer-S1 bridges this gap via the Serial Scaling Hypothesis, arguing that long-term accuracy is a function of serial computation depth.

Methodology: High-Depth Serial Forecasting

1. The TimeSTP Block

The heart of Timer-S1 is the TimeSTP (Serial-Token Prediction) block. In a standard Transformer, complexity is added by stacking layers. In Timer-S1, layers are specialized:

Main TimeMoE Blocks (L=24): Responsible for global contextual representation.
TimeSTP Blocks (H=16): These blocks don't just "process" data; each block corresponds to a specific future horizon. The $j$ -th block refines the prediction for the $(j + 1)$ -th patch by conditioning on both the previous block's output and the initial lookback series.

Overall Architecture

2. TimeBench: Trillion-Scale Diversity

Scaling requires data. The authors curated TimeBench, a corpus of 1.03 Trillion time points. To prevent the model from learning "lazy" biased trends (like always predicting a mean-reverting line), they employed:

Resampling: Training on multiple temporal resolutions.
Value-Flipping: Multiplying series by -1 to ensure the model learns dynamics rather than directional bias.

Experimental Results: Dominating the Leaderboard

Timer-S1 was evaluated on GIFT-Eval, the current gold standard for General Forecasting.

SOTA Performance: It achieved a MASE of 0.693, outperforming Chronos-2 and earlier Timer iterations.
The Horizon Advantage: The "Serial" advantage is most visible in medium-to-long-term tasks. While MTP models lose accuracy quickly as the horizon extends, Timer-S1’s serial computations keep the error accumulation in check.

Performance across horizons

Ablation: Why NTP and MTP Fail

The authors proved that Timer-S1 (24-MoE + 16-STP) outperforms a pure 40-layer NTP model. This suggests that the specialization of layers for future horizons is more parameter-efficient than general-purpose stacking. Furthermore, Timer-S1 is significantly faster than NTP models during inference because it avoids the iterative $O (N)$ rolling steps.

Comparison of Objectives

Critical Insights & Future Directions

The Takeaway: Timer-S1 proves that the "Next-Token" paradigm is not the only way to scale. By unrolling the "time" dimension into the "depth" of the architecture, we can achieve the benefits of autoregression with the speed of parallel decoding.

Limitations:

Univariate Focus: Currently, Timer-S1 handles multivariate data by flattening it. Native multivariate modeling (capturing cross-variate correlations explicitly) remains a frontier.
Exogenous Variables: The model does not yet natively "read" external drivers (like weather or news) alongside the numerical series.

Future Outlook: The team plans to integrate Timer-S1 into Agentic AI Systems, where the model acts as the "temporal reasoning engine" for multimodal agents performing planning and complex decision-making.

Author's Note: This research suggests that for time series, Depth is Time. The more we want to see into the future, the deeper the computation must be.

发现相似论文

试试这些示例

Search for recent papers other than Timer-S1 that apply Mixture-of-Experts (MoE) architectures to overcome distributional heterogeneity in large-scale time series forecasting.
Which paper first proposed the "Serial Scaling Hypothesis" in the context of deep learning, and how does Timer-S1's TimeSTP mechanism specifically implement its theoretical requirements?
Explore research that integrates time series foundation models like Timer-S1 into multimodal Agentic AI systems for autonomous reasoning and long-term planning.

[arXiv 2026] Timer-S1: Decoding the Serial Scaling Law for Billion-Scale Time Series Foundation Models

1. TL;DR

2. Context: The Scaling Bottleneck in Time Series

3. Methodology: High-Depth Serial Forecasting

3.1. 1. The TimeSTP Block

3.2. 2. TimeBench: Trillion-Scale Diversity

4. Experimental Results: Dominating the Leaderboard

4.1. Ablation: Why NTP and MTP Fail

5. Critical Insights & Future Directions