How Fast is your TTS?

Latency and the Speed of Conversation

March 13, 2024 | Latency, Conversation

When we speak, humans do incredible things incredibly quickly. We are attuned to extremely subtle differences in speed in speech. And if something is too slow, we immediately notice. For conversational AI and user experience, speed is just as important as the quality of the speech.

Rime offers the best in class latency. With time to first byte (TTFB) speeds of ~175ms (and sub-100ms for our enterprise customers), Rime outpaces our competitors who are more in the ~500ms range.

The above plot shows both time to first byte and time to last byte speeds for the following request:

curl -X POST http://users.rime.ai/v1/rime-tts \  -H 'Authorization: Bearer YOUR_KEY'\  -H 'Accept: audio/x-mulaw' \  -d '{"text": "Could you give me the serial number? Okay, that sounds great.", "speaker": "eva", "samplingRate": 8000, "reduceLatency": "True", "modelId": "mist"}'

And that's just the TTS component! ASR, text generation, network latency and other things only add to the overall latency. This leads to Rime's TTS offering interspeaker latency like the following:

rime latency

Our competitors offering the palpably slower latency, as seen in the example below, where the difference is clear:

competitor's latency

Latency in Human Conversation

Though it can vary across languages, cultures, and speakers, the amount of time for someone to stop speaking and another person to start in a conversation, is surpringly fast. On average, it takes only ~200 milliseconds. Here's a chart from a recent paper measuring the average conversational pause length across a variety of languages (English is represented as En and the numbers are in ms).

Average Turn-taking Pause Duration.

Quick aside: Cognitive scientists and linguists don't have a full understanding of how turn-taking happens so quickly, see this review for more. Basically, the time that it takes to plan to utter a word can be around 600ms! But somehow we can manage this on-the-fly and keep our own personal human latency far below that.

Rime's take on Latency

There are two main contributors to latency for TTS services: network latency and inference time:

  1. Network latency
  2. Inference time

The above is why Rime has, in parallel with developing synthetic voices that are capable of the full range of truly conversational behaviors, focused so heavily on inference time for our TTS API. We do this both from a modeling perspective, i.e. how fast can the model generate speech, and from an engineering perspective, i.e. how can we support and scale a large amount of compute globally to remove bottlenecks at server- and network-level.

Ask us about our enterprise solution for driving down latency with no reduction in reliability and reach out to learn about on-premises and on-device deployments.

And keep an eye on this space as we pull further ahead in the push to diminish latency. Our goal is to have the highest quality TTS at near 0ms speed!