Multi-stage Flow Scheduling for LLM Serving

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Multi-stage Flow Scheduling for LLM Serving

[NSDI 2025] MFS: Harmonizing the "Network Chaos" in Disaggregated LLM Serving

总结

问题

方法

结果

要点

摘要

MFS (Multi-stage Flow Scheduling) is a holistic communication orchestrator for disaggregated LLM serving. It treats KV-cache retrieval, collective communication, and P2D transfers as interdependent stages, achieving a 1.2x–2.4x improvement in TTFT SLO attainment over state-of-the-art baselines.

TL;DR

As LLM serving transitions to disaggregated prefill-decode architectures, the network has become the new bottleneck. MFS (Multi-stage Flow Scheduling) recognizes that generating the first token is not a single event but a multi-stage workflow. By introducing a Reverse Multi-Level Queue (RMLQ) and a "Defer-and-Promote" strategy, MFS prioritizes flows that block GPU computation while deferring non-urgent data transfers, boosting SLO attainment by up to 2.4x.

Background: The Hidden Cost of Disaggregation

To improve throughput, modern systems like vLLM and DistServe separate the Prefill phase (compute-heavy) from the Decode phase (memory-heavy). However, this creates a "communication storm" during the prefill stage:

Stage 1 (KV-Cache Reuse): Fetching previous context from remote nodes.
Stage 2 (Collective Comm): Heavy-duty All-to-All or All-Reduce for model parallelism.
Stage 3 (P2D Transfer): Moving the newly generated KV-cache to decode workers.

Existing schedulers (like SJF or Fair Sharing) are stage-agnostic. They might give high priority to a small P2D transfer (Stage 3), unknowingly starving a critical All-to-All flow (Stage 2) that the GPU is currently idling for. This mismatch between network priority and computational dependency is the primary cause of TTFT (Time-to-First-Token) spikes.

Methodology: The Defer-and-Promote Intuition

The core innovation of MFS is the Reverse Multi-Level Queue (RMLQ). Unlike traditional MLFQ which demotes long-running flows, RMLQ starts flows at low priority and promotes them only as their "laxity" (time remaining before a deadline violation) vanishes.

1. Handling Explicit Deadlines (Stage 3)

For Prefill-to-Decode (P2D) transfers, the deadline is known (the TTFT SLO). MFS uses a metric called Minimal Link Utilization (MLU). If a flow can meet its deadline using only 20% of the link, it is kept in a low-priority queue. As the clock ticks, if it now requires 90% of the link to finish on time, MFS promotes it. This "just-in-time" completion leaves maximum bandwidth for early-stage flows.

2. Handling Implicit Deadlines (Stages 1 & 2)

Early stages don't have a fixed "finish-by" time, but they have a structural dependency. MFS introduces the Relative Layer Index (RLI):

RLI = 0: The data is needed for the current layer.
RLI > 0: The data is needed for a future layer.

MFS strictly prioritizes flows with the smallest RLI. This ensures that the GPU never waits for data that could have been transferred earlier.

MFS Design Overview Figure: The RMLQ Architecture, illustrating how different stages are mapped to priority queues based on urgency.

Evaluation: Efficiency at Scale

The authors tested MFS on a 32-GPU cluster and scaled simulations. The results are striking:

Collective Completion Time (CCT): Reduced by ~50%. By protecting Stage 2 flows, MFS keeps the GPUs saturated.
SLO Attainment: MFS handles 1.2x to 1.4x higher request rates than Karuna (the previous state-of-the-art) while maintaining the same SLO hit rate.

Simulation Results Figure: MFS consistently outperforms SJF, EDF, and Karuna across different MoE models (Mixtral, DBRX, Grok).

Critical Insight: Why it Works

The "magic" of MFS isn't just about speed; it's about regulating earliness. Traditional schedulers try to finish every flow as fast as possible. MFS realizes that finishing a Stage 3 flow "too early" is actually harmful if it creates contention that delays a Stage 2 flow. By "delaying the non-urgent," MFS creates a "communication lane" for the critical path of the prefill DAG.

Conclusion

MFS proves that as AI models grow, the distinction between "network engineering" and "ML systems" is blurring. To achieve sub-second TTFT in global-scale LLM clusters, our network switches must become aware of the Transformer's layer-by-layer execution logic. MFS provides the blueprint for this transition.

Limitations: The current implementation relies on a centralized coordinator for inter-node tie-breaking, which might face scalability challenges in clusters with tens of thousands of GPUs. Future work could explore decentralized laxity estimation.

发现相似论文

试试这些示例

Find recent papers on network contention and communication-computation overlap specifically in disaggregated LLM architectures like Splitwise or DistServe.
What are the theoretical foundations of Least-Laxity-First (LLF) scheduling in distributed DAG tasks, and how does MFS's RMLQ compare to classic real-time systems theory?
Explore if the "Defer-and-Promote" strategy has been applied to other multi-stage AI workloads, such as multi-modal video generation pipelines or large-scale recommendation system retrievals.

[NSDI 2025] MFS: Harmonizing the "Network Chaos" in Disaggregated LLM Serving

1. TL;DR

2. Background: The Hidden Cost of Disaggregation

3. Methodology: The Defer-and-Promote Intuition

3.1. 1. Handling Explicit Deadlines (Stage 3)

3.2. 2. Handling Implicit Deadlines (Stages 1 & 2)

4. Evaluation: Efficiency at Scale

5. Critical Insight: Why it Works

6. Conclusion