TiCo: Time-Controllable Training for Spoken Dialogue Models

WisPaper

Scholar Search

Scholar QA

AI Feeds

Pricing

TrueCite

Workspace

Home

Blog

TiCo: Time-Controllable Training for Spoken Dialogue Models

TiCo: Mastering the Clock in Spoken Dialogue Models

Summary

Problem

Method

Results

Takeaways

Abstract

This paper introduces TiCo, a post-training framework that enables Spoken Dialogue Models (SDMs) to generate responses with controllable durations based on user instructions. By utilizing Spoken Time Markers (STMs) and a two-stage training process involving self-generation and Reinforcement Learning with Verifiable Rewards (RLVR), TiCo achieves state-of-the-art time-controllability in spoken interactions.

TL;DR

In the world of voice assistants, "brevity is the soul of wit"—but for AI, it’s a massive technical challenge. TiCo (Time-Controllable training) is a novel two-stage framework that teaches Spoken Dialogue Models (SDMs) how to tell time. By inserting Spoken Time Markers (STM) into their internal "thought" process, models can now follow instructions like "Summarize the news in exactly 30 seconds" with high precision, outperforming even GPT-4-based cascaded systems.

The "Time Blindness" of Modern Speech AI

While Large Language Models (LLMs) have become masters of text, their spoken counterparts (SDMs) remain "time-blind." In text, controlling length is a matter of counting tokens. In speech, duration is non-linear. Factors like:

Phonetic Composition: Some words take longer to say than others despite having fewer characters.
Prosody: Emotional or emphasized speech changes the tempo.
Speaker Variation: Different "voices" have different natural speeds.

Existing models either ignore duration instructions entirely or rely on crude truncation, leading to unnatural, cut-off responses.

Methodology: Teaching AI to "Think" in Seconds

The core insight of TiCo is that time awareness must be an intermediate planning step. The authors introduce a two-stage training pipeline:

Stage 1: Building a Temporal Compass (SFT)

The model is first trained to predict Spoken Time Markers (STMs)—discretized timestamps (e.g., <5.2 seconds>) interleaved with its text output. To do this without expensive human labeling, the authors use self-generation:

The model generates a response.
An ASR (Whisper) aligns the text to the audio to find exact timestamps.
The model is fine-tuned to predict these timestamps as if they were part of its natural language.

TiCo Framework Figure 1: The TiCo two-stage framework. Stage 1 focuses on learning time awareness through self-generated STMs; Stage 2 applies RLVR to enforce duration constraints.

Stage 2: Enforcing Constraints with RLVR

Once the model understands the concept of time, Stage 2 uses Group Relative Policy Optimization (GRPO) to reward the model for hitting specific time targets. The "Verifiable Reward" system checks:

Accuracy: Does the final STM match the requested duration?
Monotonicity: Do the timestamps only move forward?
Diversity: Is the model actually changing content rather than just repeating markers?

Experiments: Precision Under Pressure

The researchers developed TiCo-Bench, sourcing queries from benchmarks like InstructS2S and LIFEBench.

Key Performance Gains:

MAE (Mean Absolute Error): TiCo achieved an error of only 4.54 seconds, compared to 13.01 seconds for the backbone Qwen-2.5-Omni model.
Superiority over GPT: Even a cascaded system (GPT-4 generating text + a dedicated TTS) had a higher error (10.41s), proving that text-level planning isn't enough for speech-level control.

Performance Comparison Table 1: TiCo consistently outperforms all baselines (Commercial, Open-Source, and Cascaded) across both short and long duration settings.

Zero-Shot Generalization

One of the most impressive findings is Generalization. Despite being trained primarily on responses under 40 seconds, TiCo successfully managed durations up to 1 minute without a significant spike in error. It also worked seamlessly when queries were provided by text instead of speech, showing the temporal reasoning is "modality-agnostic."

Critical Insight: Why it Works

TiCo works because it doesn't try to "force" the speech synthesizer at the end of the pipeline. Instead, it adjusts the semantic content during the "thinking" phase. If the model sees it only has 5 seconds left but much to say, the learned policy encourages it to summarize or conclude more rapidly—mimicking human conversational adaptability.

Conclusion & Future Directions

TiCo represents a significant step toward functional speech agents. Whether it's a car navigation system that needs to finish a sentence before your next turn, or a medical assistant providing a 10-second summary in an emergency, time-controllability is a non-negotiable feature for real-world deployment.

The authors suggest that future work will focus on even higher precision and integrating this with other controllable behaviors like emotion and multi-turn reasoning.

Find Similar Papers

Try Our Examples

Search for recent papers published after 2024 that address duration-controllable speech generation in end-to-end spoken dialogue models.
Which paper first introduced the concept of interleaving temporal markers in multimodal LLMs, and how does TiCo's Spoken Time Marker (STM) implementation differ?
Investigate how the GRPO reinforcement learning objective has been applied to other controllable generation tasks beyond length or time, such as emotional prosody or speaker timbre.

Contents

TiCo: Mastering the Clock in Spoken Dialogue Models

1. TL;DR

2. The "Time Blindness" of Modern Speech AI

3. Methodology: Teaching AI to "Think" in Seconds

3.1. Stage 1: Building a Temporal Compass (SFT)

3.2. Stage 2: Enforcing Constraints with RLVR

4. Experiments: Precision Under Pressure

4.1. Key Performance Gains:

4.2. Zero-Shot Generalization

5. Critical Insight: Why it Works

6. Conclusion & Future Directions