Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

[Salesforce AI] Building Enterprise Realtime Voice Agents: Why Pipelining Beats "Native" S2S (For Now)

总结

问题

方法

结果

要点

摘要

Salesforce AI Research presents a comprehensive technical tutorial for building enterprise-grade, realtime voice agents using a cascaded STT-LLM-TTS streaming pipeline. Their implementation achieves a sub-1-second Time-to-First-Audio (TTFA) of ~755ms, significantly outperforming current native speech-to-speech models like Qwen2.5-Omni.

Executive Summary

TL;DR: Salesforce AI Research has released a definitive guide and codebase for building enterprise voice agents that actually work in production. By benchmarking the latest native speech-to-speech (S2S) models like Qwen2.5-Omni against cascaded pipelines, they prove that the secret to "realtime" is not a single fast model, but the orchestration of streaming components. Their architecture delivers a sub-second response time (~755ms) and, crucially, supports the function calling necessary for enterprise tasks.

Background Identity: This work is both a SOTA engineering blueprint and a high-level tutorial. It positions itself as the "missing manual" between academic S2S research and opaque production frameworks like Pipecat or LiveKit.

The Reality Check: Native S2S is Not Ready

The industry has been buzzing about native S2S models (where the model "thinks" in audio tokens). However, the authors' empirical results provide a sobering reality check.

Using Qwen2.5-Omni-7B as a case study, they found that even with optimized hardware, the Time-to-First-Audio (TTFA) hits a staggering 13.2 seconds in streaming mode.

The Bottleneck: DiT-based audio decoders (the "Talker") are currently too slow, taking ~2 seconds to generate just 1 second of audio.
The Feature Gap: No current native S2S model supports Function Calling, making them useless for booking appointments, checking databases, or managing orders.

Methodology: The Architecture of "Realtime"

The paper defines a Voice Agent as a simple but powerful equation: $e x t V o i ce A g e n t = e x t r e a so nin g + t oo l s e x t LL M A g e n t + e x t S T T + T T S + s t r e amin g e x t V o i ce I / O$

The "Realtime" magic comes from overlapping the execution of three distinct stages:

1. The Cascaded Pipeline

Instead of waiting for one stage to finish, the system uses a Sentence Buffer to bridge the LLM and TTS.

STT (Deepgram Nova-3): Streams audio chunks via WebSocket, returning "final" transcripts in <400ms.
LLM (vLLM / OpenAI): Streams tokens immediately.
The Sentence Buffer: This is the critical "glue." It buffers tokens until it detects a segment (., !, ?) and sends it to the TTS while the LLM is still generating the rest of the paragraph.

The Streaming Pipeline Latency Model

2. Implementation Stack

The authors utilized:

STT: Deepgram (P50 latency ~337ms).
LLM: Qwen2.5-7B-Instruct served via vLLM on NVIDIA A10G (TTFT ~337ms).
TTS: ElevenLabs Turbo v2.5 (TTFB ~220ms).
VAD: Silero VAD for low-latency voice activity detection.

Experiments: Breaking the 1-Second Barrier

The study compared various configurations to see how close they could get to human-like response speeds (usually cited as <500-1000ms).

| Approach | TTFA (Latency) | Function Calling | | :--- | :--- | :--- | | Qwen2.5-Omni (Native) | ~13,200ms | No | | Cascaded (Cloud APIs) | ~958ms | Yes | | Cascaded (vLLM Self-Hosted)| ~755ms | Yes |

Performance Comparison Table

The Measured TTFA of 755ms proves that a well-tuned pipeline can mask the latency of individual components through concurrency.

Deep Insights: The "Secret Sauce"

The paper shares several non-obvious engineering "gotchas" discovered during development:

Dependency Sensitivity: Using transformers >= 5.0 actually broke Qwen2.5-Omni's audio quality, requiring a specific rollback to 4.52.3.
Prompt Engineering for Speed: Simply adding "Please speak quickly" to the system prompt naturally reduced TTS duration without degrading audio quality.
VAD State Machine: Implementing a robust interruption path (SPEAKING $\to$ INTERRUPTED $\to$ LISTENING) is vital for natural turn-taking.

Conclusion & Future Outlook

The "hard part" of a voice agent isn't the voice—it's the LLM Agent (reasoning and tool use). While native S2S models are the architectural future, the cascaded pipeline is the current king of enterprise utility.

Takeaway for Devs: If you are building a voice agent today, focus on the Sentence Aggregator logic and efficient WebSocket handling. The model will eventually get faster, but the pipelining principles will remain the same.

The full 9-chapter tutorial and codebase are available on GitHub, providing a transparent alternative to proprietary "Agent-as-a-Service" platforms.

发现相似论文

试试这些示例

Search for recent papers or technical reports that implement function calling directly within native speech-to-speech or multimodal audio-LLM architectures.
Which paper first introduced the "Thinker-Talker" architecture used in Qwen2.5-Omni, and how has the "Talker" component evolved to address inference speed?
Explore research applying the "Sentence Aggregator" or "Sentence Buffer" pattern to streaming multimodal tasks beyond voice, such as realtime video captioning or translation.

[Salesforce AI] Building Enterprise Realtime Voice Agents: Why Pipelining Beats "Native" S2S (For Now)

1. Executive Summary

2. The Reality Check: Native S2S is Not Ready

3. Methodology: The Architecture of "Realtime"

3.1. 1. The Cascaded Pipeline

3.2. 2. Implementation Stack

4. Experiments: Breaking the 1-Second Barrier

5. Deep Insights: The "Secret Sauce"

6. Conclusion & Future Outlook