The paper introduces DualPath, a specialized LLM inference system designed for multi-turn agentic workloads. It breaks the KV-Cache storage I/O bottleneck by utilizing the idle storage bandwidth of decoding engines to assist prefill engines, achieving up to 1.87× throughput improvement for offline tasks and 1.96× for online serving.
Executive Summary
TL;DR: The era of "compute-is-all-you-need" is fading for AI agents. As LLMs transition from simple chatbots to multi-turn agents (like coding assistants or autonomous researchers), the bottleneck has shifted from GPU FLOPS to Storage I/O. Traditional systems choke because they only load KV-Cache through the "Prefill" node's NIC. DualPath solves this by siphoning data through the "Decoding" nodes' idle NICs, effectively doubling the available bandwidth.
Strategic Positioning: This work, emerging from DeepSeek-AI and top tier universities, represents a "system-level breakthrough." It recognizes that for models like DeepSeek-V3, the cache-compute ratio has reached a tipping point where storage bandwidth is the primary inhibitor of SOTA performance.
The Core Conflict: Why Agents are Different
In a standard chatbot, you ask a question, and the model generates an answer. In an Agentic Paradigm, the model interacts with an environment (e.g., a Python interpreter) hundreds of times.
- 95%+ Hit Rate: Almost all of the session's history is already in storage.
- I/O Bound: Loading 32k+ tokens of KV-Cache takes longer than computing the 400 new tokens of the turn.
- Hardware Asymmetry: Since NVIDIA Ampere, GPU compute power has grown 14x faster than NIC bandwidth. We are officially hitting the "Storage Wall."
Figure 1: Traditional systems (Left) saturate the prefill NIC, leaving the decode NIC idle. DualPath (Right) utilizes all available paths.
Methodology: The Dual-Path Innovation
1. Breaking the Path Monotony
The fundamental "Aha!" moment of DualPath is treating the Compute Network (East-West) as a high-speed bypass for the Storage Network (North-South).
- Path A: Storage → Prefill Engine (Traditional).
- Path B: Storage → Decode Engine → (RDMA via Compute Net) → Prefill Engine.
By "pooling" the storage NICs of every node in the cluster, DualPath transforms a single-node bottleneck into a cluster-wide resource.
2. CNIC-Centric Traffic Management
One might worry: won't moving massive KV-Caches over the compute network slow down the model's actual inference? DualPath uses Virtual Lanes (VL) and QoS on InfiniBand/RoCE. Model execution traffic (Expert Parallel/Tensor Parallel) is given 99% priority on a "High-Priority VL," while KV-Cache transfers crawl through the "Low-Priority VL." This ensures KV loading only uses "spare" cycles.
3. Adaptive Request Scheduling
DualPath doesn't just copy data blindly. It uses a global scheduler that tracks:
- Storage NIC queue lengths.
- GPU token load.
- HBM capacity.
It dynamically decides: "Engine A is busy computing, so let's use Engine B's NIC to fetch the data for Engine A."
Performance: Benchmarking the "Agent" Era
The system was tested on massive 660B MoE models (DeepSeek-V3.2).
- Offline Batching: For RL training (where agents "roll out" thousands of trajectories), DualPath reduced completion time by nearly 50%.
- Online Serving: Under heavy loads, the Time-to-First-Token (TTFT) in baseline systems explodes due to queuing for the NIC. DualPath keeps TTFT flat by distributing the load.
Figure 2: Throughput across different model sizes. DualPath (Green) consistently hugs the "Oracle" line (theoretical max).
Critical Insight: The "I/O Wall" is the New Frontier
The most profound takeaway from the DualPath paper is the Cache-Compute Ratio analysis. For models like DeepSeek-V3, the optimized Multi-head Latent Attention (MLA) reduces KV-Cache size, but because the model is so fast at computing, it finishes its math before the NIC can finish the "Read" operation.
The Future: As we move toward 1-million-token contexts for agents, we cannot simply buy more GPUs. We must rethink the PCIe and Network topology. DualPath proves that software-defined I/O pooling is the most cost-effective way to sustain the next generation of autonomous AI.
Limitations
- Complexity: Implementation requires tight control over RDMA and Network Switches (QoS).
- Memory Overhead: Requires a dedicated DRAM buffer on nodes to act as a "staging area" for the dual paths.
Conclusion
DualPath is a masterclass in modern systems engineering. It identifies a hardware imbalance (NIC vs GPU) and solves it not with more hardware, but with a more intelligent, non-local data path. For anyone building the "operating system" for AI agents, this is required reading.
