DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

[DeepSeek Tech] DualPath: Shattering the Storage Bandwidth Wall in Agentic LLM Inference

总结

问题

方法

结果

要点

摘要

The paper introduces DualPath, a specialized LLM inference system designed for multi-turn agentic workloads. It breaks the KV-Cache storage I/O bottleneck by utilizing the idle storage bandwidth of decoding engines to assist prefill engines, achieving up to 1.87× throughput improvement for offline tasks and 1.96× for online serving.

Executive Summary

TL;DR: The era of "compute-is-all-you-need" is fading for AI agents. As LLMs transition from simple chatbots to multi-turn agents (like coding assistants or autonomous researchers), the bottleneck has shifted from GPU FLOPS to Storage I/O. Traditional systems choke because they only load KV-Cache through the "Prefill" node's NIC. DualPath solves this by siphoning data through the "Decoding" nodes' idle NICs, effectively doubling the available bandwidth.

Strategic Positioning: This work, emerging from DeepSeek-AI and top tier universities, represents a "system-level breakthrough." It recognizes that for models like DeepSeek-V3, the cache-compute ratio has reached a tipping point where storage bandwidth is the primary inhibitor of SOTA performance.

The Core Conflict: Why Agents are Different

In a standard chatbot, you ask a question, and the model generates an answer. In an Agentic Paradigm, the model interacts with an environment (e.g., a Python interpreter) hundreds of times.

95%+ Hit Rate: Almost all of the session's history is already in storage.
I/O Bound: Loading 32k+ tokens of KV-Cache takes longer than computing the 400 new tokens of the turn.
Hardware Asymmetry: Since NVIDIA Ampere, GPU compute power has grown 14x faster than NIC bandwidth. We are officially hitting the "Storage Wall."

Existing bottleneck vs DualPath Figure 1: Traditional systems (Left) saturate the prefill NIC, leaving the decode NIC idle. DualPath (Right) utilizes all available paths.

Methodology: The Dual-Path Innovation

1. Breaking the Path Monotony

The fundamental "Aha!" moment of DualPath is treating the Compute Network (East-West) as a high-speed bypass for the Storage Network (North-South).

Path A: Storage → Prefill Engine (Traditional).
Path B: Storage → Decode Engine → (RDMA via Compute Net) → Prefill Engine.

By "pooling" the storage NICs of every node in the cluster, DualPath transforms a single-node bottleneck into a cluster-wide resource.

2. CNIC-Centric Traffic Management

One might worry: won't moving massive KV-Caches over the compute network slow down the model's actual inference? DualPath uses Virtual Lanes (VL) and QoS on InfiniBand/RoCE. Model execution traffic (Expert Parallel/Tensor Parallel) is given 99% priority on a "High-Priority VL," while KV-Cache transfers crawl through the "Low-Priority VL." This ensures KV loading only uses "spare" cycles.

3. Adaptive Request Scheduling

DualPath doesn't just copy data blindly. It uses a global scheduler that tracks:

Storage NIC queue lengths.
GPU token load.
HBM capacity.

It dynamically decides: "Engine A is busy computing, so let's use Engine B's NIC to fetch the data for Engine A."

Performance: Benchmarking the "Agent" Era

The system was tested on massive 660B MoE models (DeepSeek-V3.2).

Offline Batching: For RL training (where agents "roll out" thousands of trajectories), DualPath reduced completion time by nearly 50%.
Online Serving: Under heavy loads, the Time-to-First-Token (TTFT) in baseline systems explodes due to queuing for the NIC. DualPath keeps TTFT flat by distributing the load.

Experimental Results Figure 2: Throughput across different model sizes. DualPath (Green) consistently hugs the "Oracle" line (theoretical max).

Critical Insight: The "I/O Wall" is the New Frontier

The most profound takeaway from the DualPath paper is the Cache-Compute Ratio analysis. For models like DeepSeek-V3, the optimized Multi-head Latent Attention (MLA) reduces KV-Cache size, but because the model is so fast at computing, it finishes its math before the NIC can finish the "Read" operation.

The Future: As we move toward 1-million-token contexts for agents, we cannot simply buy more GPUs. We must rethink the PCIe and Network topology. DualPath proves that software-defined I/O pooling is the most cost-effective way to sustain the next generation of autonomous AI.

Limitations

Complexity: Implementation requires tight control over RDMA and Network Switches (QoS).
Memory Overhead: Requires a dedicated DRAM buffer on nodes to act as a "staging area" for the dual paths.

Conclusion

DualPath is a masterclass in modern systems engineering. It identifies a hardware imbalance (NIC vs GPU) and solves it not with more hardware, but with a more intelligent, non-local data path. For anyone building the "operating system" for AI agents, this is required reading.

发现相似论文

试试这些示例

Search for recent papers addressing the KV-Cache I/O bottleneck in long-context LLM serving beyond disaggregated architectures.
Which paper first proposed Prefill-Decode (PD) disaggregation, and how does DualPath's network utilization differ from that original design?
Find studies that apply RDMA-based KV-Cache offloading or hierarchical storage management techniques specifically for Reinforcement Learning (RL) rollout phases.

[DeepSeek Tech] DualPath: Shattering the Storage Bandwidth Wall in Agentic LLM Inference

1. Executive Summary

2. The Core Conflict: Why Agents are Different

3. Methodology: The Dual-Path Innovation

3.1. 1. Breaking the Path Monotony

3.2. 2. CNIC-Centric Traffic Management

3.3. 3. Adaptive Request Scheduling

4. Performance: Benchmarking the "Agent" Era

5. Critical Insight: The "I/O Wall" is the New Frontier

5.1. Limitations

6. Conclusion