Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

[Benchmark 2025] Modern Deep Learning for Finance: Why xLSTM and Hybrids Beat Transformers

Summary

Problem

Method

Results

Takeaways

Abstract

This paper presents a large-scale benchmark of modern deep learning architectures—including Transformers, State-Space Models (Mamba), and xLSTM—for financial time-series prediction and position sizing. The study identifies that hybrid models like VSN+LSTM (VLSTM) and xLSTM-based variants achieve superior risk-adjusted performance (SOTA Sharpe Ratio of 2.40) over traditional linear baselines and generic deep learning models throughout 15 years of cross-asset data.

TL;DR

A massive 15-year benchmark from the University of Oxford reveals that the "bigger is better" Transformer philosophy fails in financial markets. Instead, models with adaptive gating and structured memory—specifically VLSTM (VSN + LSTM) and the new xLSTM—dominate the leaderboard with Sharpe Ratios exceeding 2.30, proving that denoised temporal representations are the secret sauce for Navigating "noisy" market regimes.

Background: The Financial "Noise" Problem

Most deep learning benchmarks focus on datasets like weather or electricity, where the signal is loud and clear. Finance is the opposite. It is a low signal-to-noise environment where patterns are fleeting. This paper asks: can modern heavyweights like Mamba2, PatchTST, and xLSTM actually survive a real-world trading backtest across Commodities, FX, Bonds, and Equities?

Methodology: Beyond Simple Forecasting

The authors didn't just predict price; they built an end-to-end portfolio optimization pipeline.

Input: Statistical/technical indicators + Ticker Embeddings.
Architecture: A variety of encoders (Linear, Transformer, SSM, Recurrent).
Loss Function: Direct optimization of the Differentiable Sharpe Ratio, forcing the model to find risk-adjusted returns rather than just minimizing Mean Squared Error.
Risk Control: Volatility targeting to equalize contributions across different assets.

Portfolio Optimization Pipeline Figure 1: The architecture transforms raw features into weights by maximizing the Sharpe Ratio.

Key Architectures Compared

1. The Comeback of Recurrence: xLSTM & VLSTM

While Transformers are the current trend, this study finds that Recurrent Neural Networks (RNNs) are far from dead.

xLSTM: Uses exponential gating and matrix memory to prevent the "forgetting" issue of old LSTMs.
VLSTM: Adds a Variable Selection Network (VSN) to "denoise" the input before it even hits the LSTM.

2. State-Space Models (Mamba & Mamba2)

Mamba offers linear scaling and "infinite" lookback. However, in this benchmark, it showed heterogeneous behavior, meaning it worked well in some years (2020) but struggled to match the consistent risk-adjusted returns of gated recurrent models.

3. Transformers (PatchTST & iTransformer)

PatchTST breaks time series into "patches" to smooth noise. While it performs better than basic Transformers, it often lacks the stable "state" required to handle market regime shifts compared to VLSTM.

Experimental Results: The Leaderboard

The results were striking. VLSTM and LPatchTST (a hybrid of LSTM and Patching) were the clear winners.

| Strategy | Sharpe Ratio (2010-2025) | CAGR | Max Drawdown | | :--- | :--- | :--- | :--- | | VLSTM | 2.40 | 26.3% | -22.9% | | LPatchTST | 2.31 | 25.5% | -17.4% | | xLSTM | 1.80 | 19.3% | -14.1% | | Mamba2 | 0.78 | 5.8% | -26.3% | | AR1x (Linear) | 0.77 | 8.1% | -16.7% |

Depth Insight: Transaction Cost & Efficiency

One of the most valuable parts of this paper is the Breakeven Transaction Cost analysis. A model can have a high Sharpe Ratio but trade so frequently that costs eat all profits.

xLSTM showed the most "signal-to-trade" efficiency, maintaining a higher cost buffer (breakeven cost) in liquid contracts compared to others.

PnL Comparison Figure 2: Cumulative PnL paths. Notice the stability of the sequence-based models compared to the flat performance of linear baselines.

Critical Analysis & Takeaways

Inductive Bias > Model Size: Financial data is too small and noisy for "foundation model" scaling to work out-of-the-box. The "statefulness" of LSTMs provides an Inductive Bias that filters noise better than global attention.
Hybrids are King: Combining feature selection (VSN) with sequence modeling (LSTM/xLSTM) provides two layers of denoising—essential for survival in non-stationary markets.
The "Why" behind xLSTM's Success: xLSTM's exponential gating allows it to "ignore" high-frequency noise while latching onto rare but powerful economic signals that standard sigmoidal LSTMs would saturate and forget.

Conclusion

This benchmark provides a sobering reality check for AI in Finance: theoretical efficiency (like Mamba's scaling) does not always lead to empirical profitability. For practitioners, the move is toward Hybridized Recurrent Architectures that prioritize stability and feature-selection over the raw complexity of generic Transformers.

Are you still using standard LSTMs for Alpha generation? It might be time to look at VSN+xLSTM.

Find Similar Papers

Try Our Examples

Search for recent papers that apply xLSTM or sLSTM architectures specifically to high-frequency financial or algorithmic trading tasks.
Which study first introduced the Variable Selection Network (VSN) within the Temporal Fusion Transformer, and how has its integration evolved for non-stationary data?
Find comparative studies that evaluate the robustness of State-Space Models (SSMs) like Mamba against Transformers in low signal-to-noise ratio environments.

Contents

[Benchmark 2025] Modern Deep Learning for Finance: Why xLSTM and Hybrids Beat Transformers

1. TL;DR

2. Background: The Financial "Noise" Problem

3. Methodology: Beyond Simple Forecasting

4. Key Architectures Compared

4.1. 1. The Comeback of Recurrence: xLSTM & VLSTM

4.2. 2. State-Space Models (Mamba & Mamba2)

4.3. 3. Transformers (PatchTST & iTransformer)

5. Experimental Results: The Leaderboard

5.1. Depth Insight: Transaction Cost & Efficiency

6. Critical Analysis & Takeaways

7. Conclusion