Introducing Mist: Next-gen Conversational Voice Synthesis

February 28, 2024 | Launch, Conversational Voice

Today marks the launch of Rime's new, next-gen conversational voice model, Mist. With this launch, true conversational synthetic speech is at hand and enterprise-ready.

And now there came both mist and snow,
And it grew wondrous cold:
And ice, mast-high, came floating by,
As green as emerald.

The Rime of the Ancient Mariner, Samuel Taylor Coleridge, 1798

The Basics

  • Powered by LLMs, Voice UI is at its sharpest inflection point in history.
  • Rime is releasing Mist, the first TTS model to reproduce the genre-specific characteristics of voice, served via API at sub-200ms latencies for most input sequences.
  Mist is available via API today.

The world of synthetic voice is taking its next evolutionary step forward with Mist. Not only does Mist offer ultra-fast, human-speed latency for large-scale voice-interaction systems, Mist does so with truly realistic conversational-style voices in a demographically diverse array of accents and voice-types. Take a listen:

Conversational TTS Sample One
Conversational TTS Sample Two

Lowest Latency on the Market

In real life, human conversation, the gaps between speaking turns are usually less than 400 milliseconds. In order for a synthetic voice to sound natural in conversation, low latency is an absolute must. Mist offer the lowest latency on the market at sub-200ms response times for many prompts and we're driving it down week after week.

We are also working with a number of trusted partners to deliver bespoke compute solutions that provide sub-100ms latency. Contact us for more details if this sounds like something your team would be intersted in.

True-to-life Conversational Style

Rime has collected the largest proprietary dataset of conversational speech in the world and it's growing rapidly. This data allows us to create modeling tech that is both hyperrealistic and enterprise-ready.

Filler words, like 'um' and 'like' as well as backchanneling affirmations like 'mhmm' and 'uh-huh' can be the difference between a synthetic voice seeming real or uncanny. At Rime, we use our decades of linguistics expertise to sculpt voice offerings that sound the way real humans do.

Our unique data also allows us to imbue Mist voices with true-to-life breaths and false-start sounds, the way we all subconsciously speak. Keep a look-out for updates on new product features along these lines for toggling these on or off.

A General Philosphy of Voice

At Rime, we think the future of voice technology is not just scraping audiobook data and reproducing canned-sounding audio, but rather nuanced, curated voices that reflect the specific and unique ways that real people talk. Although with Mist we can clone your trusted voices and can generated voice actor-style speech, the real gains in Conversational AI are going to be found in replicating the subtleties and characteristics of the wide variety of actual people.

This means offering voices of the entire range of accents as well as demographics such as age, gender, ethnicity, sexuality, and so on. As part of our Mist launch we will be periodically rolling out new voices and filling out our demographically and stylistically diverse roster. Along with our cloning abilities, these will allow you to tailor you voice to your particular task. Check our dashboard where you can filter off-the-shelf voices for your particular use case.

Wrapping Up

Mist is the next step in the evolution of synthetic speech. We're confident that the combination of super low latency and true-to-life conversational style is the perfect fit for countless voice tech applications. Contact us to join in on the tens of thousands of daily calls, powered by Rime!