So-called AI hallucinations are real shortcomings of much recent generative AI, and text-to-speech is no exception. However, Rime will never hallucinate. Here we investigate TTS model architectures and how they navigate between sense and delusion...
What is a TTS Hallucination
There are plenty of implicit assumptions that lie behind the use of the word 'hallucination' for AI-related phenomena. So before diving into the TTS version of them, it might be helpful to lay out a few preliminaries.
If we take the traditional meaning of the term as it applies to humans, a hallucination refers to sensing things that "aren't there" in reality. When it comes to AI, this interpretation of the word finds ready analogues in classification, recognition, and text-generation systems. For example, the model says there's a dog in the picture when there actually isn't one; the model thinks the tomato is an apple, etc. This view of hallucinations takes the stance of a 3rd-party objective assessment: the system reports X, but reality says Y.
But for certain types of generative AI, the use of the term hallucination is novel and much more interesting. That objective 3rd-party stance is turned it inward, and the output of the system is implicitly analogized to 1st-person human perception. Generative AI systems produce images or video that we perceive with our familiar human senses, and when they present things that don't correspond to reality, the effect is phenomenologically similar to when we as humans hallucinate. In short, certain generative AI hallucinations are basically inducing hallucinations in humans.
When it comes to text-to-speech hallucinations, these are an interesting mix of the two notions above. On the one hand, when the speech output of a TTS system deviates from the input text, that's effectively a misrecognition of the stipulated text "reality". On the other hand, the auditory experience mimics 1st-person human mis-perception in the same way a deviant image does.
The use of the term hallucination for any sort of AI failure might be a little tendentious and self-congratulating, but this is roughly the lay of the land. Next, let's look into some instances of TTS hallucinations and ways to avoid them.
When Speech Goes Wrong
The standard approach to TTS is to employ Autoregressive Models. These models generate predictions of future output values based on its own previous output coupled with a stochastic component. It's this stochastic component in particular that leads each generation of the model to be different from the last, and also makes the output susceptible to ever-increasing deviation from what is sensible to humans.
For TTS, there is generally an alignment between the text and the speech output, but as the speech audio is being generated, the stochastic nature of the process can lead to increasingly large deviations or misalignments from the input text. As such, longer sentences can lead to some pretty striking "hallucinations".
Additionally, many TTS models are built upon scraped data and annotated by speech-recognition models which are themselves prone to hallucinations (they seem to especially frequently hallucinate the phrase 'please like and subscribe' for some reason. hmmm.). This only compounds the propensity for models to hallucinate.
All together, this makes for an environment where hallucinations are a real danger.
For the text below, an autoregressive model can output the following disaster:
"I really hope that what we're talking about here extends, without harming any creature, to all the animals that we've seen so far, including dogs and cats and rats and frogs and gnats."
In the above sample, we hear some common types of TTS hallucinations, the repetition of words or phrases and making up words that are phonetically similar to the target words. In this example, we hear the repetition of 'and rats' and the mispronunciation of 'frogs' and 'gnats'. This might also have something to do with the coordinations ('and') in the sentence, but outside of the length of the sentence, the causes of these are relatively opaque. Also, at the end of this clip we can also hear a little blip of additional, superfluous vocalization, not otherwise indicated in the input text.
Take a listen to another sentence, with two different voice's hallucinations. The first includes not just repetitions and words that aren't there in the input text, but also includes a sort of glitch sound in the middle. In the second example, the sentence is radically altered and just cuts out early.
"Commercial viability at scale demands certain things of text to speech, in particular the ability to remain faithful to every single word, every 'this' and every 'that', every 'but' and every 'and'."
This is a deep problem for any use of TTS for business purposes.
A Standard Approach to Avoid Hallucinations
The standard (hacky) way to avoid these wacky hallucinations is to artificially chop up the input text, generate speech for the separate smaller bits, and then stitch them back together. This nearly eliminates hallucinations of the sort we see above by making the generated clips too short for the autoregressive model's propensity to deviate from the text to arise.
However, by breaking up the sentence and generating its parts separately, this method ends up introducing new types of errors (or hallucinations if you will). Note the sentence below which was generated by an autoregressive model that breaks sentences up this way. The break is extremely obvious and the prosody is like that of two unrelated sentences. This results in the awkward audio we hear here:
"This breeze, which has traveled from the regions towards which I am advancing gives me a foretaste of those icy climes."
This is obviously not acceptable for business use cases, so we seem to be a little bit between a rock and hard place. It also indicates the declining utility of the term "hallucination" which seems, if applicable here, to mostly just be a euphemism for error or failure.
The Rime Difference: Voices you can Trust
At Rime, we have developed means to avoid both of these problems and offer models that will never generate hallucinations. How? Well, that's our business. But as parting words, just let us say the following: