WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[NIPS 2024] Offline Decision Transformers for TSP: From Imitation to Optimization
总结
问题
方法
结果
要点
摘要

This paper introduces an offline Reinforcement Learning (RL) approach for the Traveling Salesman Problem (TSP) by adapting the Decision Transformer (DT) framework. By integrating a Pointer Network and optimistic Return-to-Go (RTG) prediction via expectile regression, the method significantly outperforms the four classical heuristics (NN, NI, FI, SA) it was trained on.

TL;DR

Researchers have successfully applied the Decision Transformer (DT) to the Traveling Salesman Problem (TSP) in an offline setting. By training on data from existing heuristics (like Simulated Annealing), the model doesn't just copy them—it outperfroms them. Key innovations include a Pointer Network head for node selection and Expectile Regression for predicting optimistic performance targets, allowing the model to "stitch together" the best parts of sub-optimal solutions.

Background & Motivation: Beyond the Online Bottleneck

Neural Combinatorial Optimization (NCO) has traditionally lived in the realm of Online Reinforcement Learning. While effective, online RL is a "data hungry" beast that ignores the treasure trove of existing heuristic algorithms (NN, FI, SA) developed over decades.

The challenge? Standard Offline RL frameworks like the Decision Transformer are built for environments with fixed action meanings (e.g., "move left"). In TSP, "Node 5" means something different in every instance. Furthermore, the "Return-to-Go" (RTG) varies wildly; a tour length of 4.0 might be amazing for one graph but terrible for another.

Methodology: The NCO-Decision Transformer

To bridge the gap between sequence modeling and graph optimization, the authors introduced two critical modifications:

1. Pointer Network Action Head

Instead of a fixed softmax over indices, the model uses a Pointer Network mechanism. It looks at the current Transformer state and "points" to node embeddings generated by an initial Encoder. This preserves spatial context and allows the model to handle variable graph sizes.

2. Optimistic RTG Prediction with Expectile Regression

If you tell a DT to reach a target it has never seen, it fails. If you give it an average target, it acts... average. The authors use Expectile Regression ($\alpha=0.99$) to predict the best possible return (lowest cost) for a given graph. This "optimistic conditioning" pushes the model to explore the upper bounds of the behavioral policy it observed in the training data.

Model Architecture

Experimental Breakthroughs

The results are striking. Across various heuristics, the DT model consistently achieved a lower Optimality Gap than the data it was trained on.

  • The "Stitching" Effect: On Simulated Annealing (SA) data for N=100, the original heuristic had a 12.36% gap. The DT reduced it to 6.07%. This suggests that the Transformer is learning to identify high-quality sub-tours and recombine them more effectively than the stochastic SA process.
  • RTG vs. Behavior Cloning (BC): Simple imitation (BC) failed to match the DT's performance. Without the RTG "goal," the model lacks the guidance to choose the trajectory leading to the global optimum.

Performance Table

Deep Insight: The Value of "Optimism"

The ablation study on the expectile parameter $\alpha$ reveals a core truth: higher $\alpha$ values (putting more weight on under-prediction errors) lead to better solutions. By training the model to expect better-than-average returns, we transform a sequence predictor into an optimizer. However, the authors noted that even their "optimistic" predictions were sometimes conservative—adding a manual offset to the RTG during inference yielded even better results in some cases.

Conclusion & Future Outlook

This work repositioned the Decision Transformer as a powerful tool for Combinatorial Optimization. It proves that we don't always need to "reinvent the wheel" with expensive online RL; instead, we can use the "wheels" we already have (heuristics) to train models that are faster and smarter.

Limitations: The model is still sensitive to the diversity of training data. As seen with the Nearest Neighbor (NN) heuristic, if the training data lacks variety, the DT has no "better paths" to discover. Future research will likely focus on mixing multiple heuristics to create a "Super-Heuristic" via Offline RL.

发现相似论文

试试这些示例

  • Search for recent papers that apply Offline Reinforcement Learning or Decision Transformers to other NP-hard combinatorial optimization problems like the Vehicle Routing Problem (VRP) or Job Shop Scheduling.
  • Who first proposed the Pointer Network for combinatorial optimization, and how have subsequent Transformer-based architectures improved upon the original pointer mechanism for TSP?
  • Investigate studies that utilize expectile regression or "optimistic conditioning" in Decision Transformers to improve out-of-distribution performance in offline RL tasks.
目录
[NIPS 2024] Offline Decision Transformers for TSP: From Imitation to Optimization
1. TL;DR
2. Background & Motivation: Beyond the Online Bottleneck
3. Methodology: The NCO-Decision Transformer
3.1. 1. Pointer Network Action Head
3.2. 2. Optimistic RTG Prediction with Expectile Regression
4. Experimental Breakthroughs
5. Deep Insight: The Value of "Optimism"
6. Conclusion & Future Outlook