Timer-S1 is a billion-scale time series foundation model featuring 8.3B total parameters and a Mixture-of-Experts (MoE) architecture. It introduces "Serial Scaling" through a novel Serial-Token Prediction (STP) objective and TimeSTP blocks, achieving SOTA performance on the GIFT-Eval leaderboard.
TL;DR
Timer-S1 marks a milestone in the evolution of Time Series Foundation Models (TSFMs). By scaling to 8.3 Billion parameters (32 experts, 2 activated) and introducing Serial-Token Prediction (STP), it breaks the efficiency-accuracy trade-off inherent in traditional autoregressive models. Unlike prior works that either roll step-by-step (slow) or predict in parallel (low precision for long-term), Timer-S1 utilizes specialized TimeSTP blocks to provide deep serial computation in a single inference pass.
Context: The Scaling Bottleneck in Time Series
While Language Models (LLMs) have scaled gracefully using Next-Token Prediction (NTP), Time Series data presents a different beast. It is non-stationary, heterogeneous, and lacks the discrete grammatical structure of language.
The industry has been stuck between two worlds:
- Iterative Autoregression (NTP): Conceptually sound but suffers from "Exposure Bias" and massive computational overhead during long-term inference.
- Parallel Multi-Token Prediction (MTP): Fast, but because it predicts all future points from the same representation layer, it lacks the "progressive reasoning" required to model how error and uncertainty evolve over time.
Timer-S1 bridges this gap via the Serial Scaling Hypothesis, arguing that long-term accuracy is a function of serial computation depth.
Methodology: High-Depth Serial Forecasting
1. The TimeSTP Block
The heart of Timer-S1 is the TimeSTP (Serial-Token Prediction) block. In a standard Transformer, complexity is added by stacking layers. In Timer-S1, layers are specialized:
- Main TimeMoE Blocks (L=24): Responsible for global contextual representation.
- TimeSTP Blocks (H=16): These blocks don't just "process" data; each block corresponds to a specific future horizon. The -th block refines the prediction for the -th patch by conditioning on both the previous block's output and the initial lookback series.

2. TimeBench: Trillion-Scale Diversity
Scaling requires data. The authors curated TimeBench, a corpus of 1.03 Trillion time points. To prevent the model from learning "lazy" biased trends (like always predicting a mean-reverting line), they employed:
- Resampling: Training on multiple temporal resolutions.
- Value-Flipping: Multiplying series by -1 to ensure the model learns dynamics rather than directional bias.
Experimental Results: Dominating the Leaderboard
Timer-S1 was evaluated on GIFT-Eval, the current gold standard for General Forecasting.
- SOTA Performance: It achieved a MASE of 0.693, outperforming Chronos-2 and earlier Timer iterations.
- The Horizon Advantage: The "Serial" advantage is most visible in medium-to-long-term tasks. While MTP models lose accuracy quickly as the horizon extends, Timer-S1’s serial computations keep the error accumulation in check.

Ablation: Why NTP and MTP Fail
The authors proved that Timer-S1 (24-MoE + 16-STP) outperforms a pure 40-layer NTP model. This suggests that the specialization of layers for future horizons is more parameter-efficient than general-purpose stacking. Furthermore, Timer-S1 is significantly faster than NTP models during inference because it avoids the iterative rolling steps.

Critical Insights & Future Directions
The Takeaway: Timer-S1 proves that the "Next-Token" paradigm is not the only way to scale. By unrolling the "time" dimension into the "depth" of the architecture, we can achieve the benefits of autoregression with the speed of parallel decoding.
Limitations:
- Univariate Focus: Currently, Timer-S1 handles multivariate data by flattening it. Native multivariate modeling (capturing cross-variate correlations explicitly) remains a frontier.
- Exogenous Variables: The model does not yet natively "read" external drivers (like weather or news) alongside the numerical series.
Future Outlook: The team plans to integrate Timer-S1 into Agentic AI Systems, where the model acts as the "temporal reasoning engine" for multimodal agents performing planning and complex decision-making.
Author's Note: This research suggests that for time series, Depth is Time. The more we want to see into the future, the deeper the computation must be.
