How fast is fast when it comes to TTS?
When discussing responsiveness among the team, talking to customers, or with various vendors, a common metric we use is "time to first byte" (TTFB). In the wild, TTFB is used to describe various related things, but is generally defined as the time from request initiation (including connection time and SSL handshake) to the first byte of the server's response. This image from a blog post from Cloudflare helps break down the various components.
As mentioned earlier, adherence to that definition varies. Some instances will only describe the time between sending the request content (after connection) and receiving the first byte. In the case of Baseten, for example, their metrics dashboard uses TTFB to mean the time between receiving a request and responding with the first byte.
In context, it makes sense for Baseten to measure it that way. They don't have access to any of the client connection information anyway. This usage is exemplary of how TTFB usage ends up being highly contextual, and the simple term can easily be conflated with several different measures. TTFB may not even be the best measure, depending on that context.
In the world of web development, an alternative measure for perceived responsiveness is "first contentful paint" (FCP). FCP measures up until the first bit of content gets rendered in the browser. To me, as a person who browses the web regularly, FCP is a more tangible measure than TTFB. That first byte won't change my blank browser tab into something I can look at. If anything, it's just a byte from a header, which is meaningful to my browser, but not to me.
Given the context of Rime and its usage, it's worth considering what the appropriate measurement is for us.
A Useful Measure For Audio
The example of "first contentful paint" is illustrative, since it provides us a great analogy. If FCP describes a metric for the time to a viewable response, even if incomplete (there's a different metric, "largest contentful paint" that encompasses the time to the most significant render), then we're looking for a metric that captures the time to a listenable response.
Rime, as well as other companies that provide text-to-speech, offer APIs that respond with an immediately playable audio stream (after all of the headers). Rather than measure the time to that first header byte, what if we stop the timer when we see the first byte of audio?
Methodology
To do this comparison, we created a corpus of 100 utterances, ranging from short one word sentences ("Please.") to longer multi-sentence expressions ("Totally! Glad you see where I'm coming from. So, uh, let's craft a new one with that in mind. Maybe mix in some of your hobbies or a memorable date with symbols and numbers."). For each vendor, we measured the server response time to the first byte of audio for each utterance.
Results
(For guidance on how to read box plots, please check out the wikipedia article)
Conclusion
First of all, Deepgram is fast! And they deserve credit for that. Seeing their performance serves as a prompt for us to find even more ways to reduce latency. We have experimental features, like websocket support, that we've explored for further improvements, and are excited to further refine our architecture to bring that time down more.
It's also worth mentioning that there's an inherent, but not zero-sum, tension between voice quality and responsiveness. During these tests, we also captured the corresponding audio, and when comparing Rime to Deepgram and other fast vendors, we were pleased with the quality and precision of Rime's TTS.
There’s always space to lower latency by simplifying the text-to-speech model. That's not the race that we're interested in running right now. At the moment, we're pleased to be the fastest high quality TTS provider, and we're in pursuit of more ways to improve even further both the speed and quality of our offerings.
See you out there.