True-to-life TTS conversations are only possible with the correct use of filler words. We generally hardly even notice them in spontaneous speech, but their absence or misuse can be extremely glaring...
Uh, let me be clear
Um, uh, it might be a little strange to see words like um and uh in relatively professional writing like this, but in speech, these words (and others such as well, you know, and I mean among others) are ubiquitous and constantly depended upon to enact many important social niceties.
In this post we'll sketch a few of their characteristics, and most importantly, offer a rough guide on where to insert these to make text sound way more human-like.
These filler words are often scorned, much like countless other speech nuances that people understand poorly, but they are clear markers of important discourse-level moves that people use to communicate effectively.
Where do ums and uhs go?
Speakers can insert filler words anywhere in an utterance, but certain locations are more prone to have them than others.
The beginning of the utterance
For a variety of reasons, the are most likely to be found at the beginning of utterances. See this classic work by Donald Boomer on the subject. This functions as a sort of preparation for what the speaker is about to say, indicating to the listener that they're going to take this conversational turn. Note that the effect is subtly affected by the use of one or two initial filler words.
After the first word of a phrase (which is then often repeated)
Stanford psycholingist Herb Clark has done really fascinating work on the placement of these fillers, and another interesting and relevant location for them is similarly near the front of an utterance, but in his 1998 article "Repeating Words in Spontaneous Speech", he and fellow Stanford linguist Tom Wasow explore an interesting effect that these words can have on the environments they arise in.
The main upshot is that when a filler word follows the first word in an utterance or phrase, that first word is often repeated, especially closer to the front of the utterance.
In the chart below we see the rates of repetition for the word the. Topics and Subjects are usually the first phrases of an utterance and direct objects and objects of prepositions come later. Similar facts hold for other words like a and I and so on.
Before infrequent, long, or complex words or phrases.
Note in the above chart that repetitions of this sort are more likely when the following phrase is complex, but length and frequency can also indicate likely places for filler words. In this wide-ranging and fascinating paper by Sharon Goldwater, Dan Jurafsky, and Chris Manning, among the many other things they show is that word frequency correlates with disfluencies, of which filled words are prime examples.
Basically, filled words are more likely, and thus more natural, before a word like the infrequent timorous than the more frequent scared. The effect is that the um before timorous sounds relatively natural and neutral, whereas the um before scared is marked and rhetorical in effect.
What do ums and uhs do?
There are many reasons people use filler words. And upon reflection I'm sure the reader can confirm these and come up with more of their own, but here a few important ones:
It's my turn to talk
Most relevant to TTS conversation concerns discourse-centric turn-taking. Their use can indicate that the speaker wants to "hold the floor" in the conversation or relatedly start their turn in the conversation. Basically, the speaker is just making a semantically-bleached vocalization (uhhhh) to indicate that they are about to start their turn. Using these allows the speaker a few precious milliseconds to plan and start producing their utterance.
Using TTS for conversation requires natrualistically taking part in these subtle turn-taking cues.
What I'm about to say is weighty
As noted above, filler words often show up in front of infrequent words. This can be due to lexical retrieval difficulties (the speaker just needs to try harder to find the word) or because the speaker wants the listener to pay extra attention to the forthcoming word, either because it's infrequent or for rhetorical effects of various types.
I'm not sure of myself
The other main motivation (and one of reasons filler words are derided) is to indicate that the speaker is nervous or not confident about what they are about to say. This can be a nearly physiological response, but more interesting is its use as a rhetorical distancing device for the speaker.
Who uses them?
Well, everyone. But there are a number of subtleties and differences. This post from the Language Log goes into fascinating detail on gender differences between men and women's use of um and uh respectively. Upshot: Women use um a little more than men; Men use uh vastly more than women.
But filler words in general are used at mostly the same rates across genders and ages, according to a study by Charlyn Laserna, Yi-Tai Seih, and James Pennebaker. Indicated in the chart below
Lastly, cross-linguistically there is ample variation as well. We've only scratched the surface here with English. This work compares English and Spanish and French and is a good place to start for investigating the variations across languages.
Wrapping up
The particulars of speech are extremely nuanced, and getting things wrong is readily noticeable. TTS is particularly prone to falling afoul of these norms when these nuances are not indicated in text. This is this case with filler words, which are more creatures of speech than text. For this reason it's all the more important to be consciously aware of how they function when TTS is used to embody conversational norms.
But with a few simple heuristics, conversational AI can be made much more life-like. When creating text for conversational TTS consider adding filler words:
- At the beginning of your text utterances
- Between repetitions of small, functional words like the, a, I, etc.
- Before infrequent, long, or complex words or phrases
- Before any word you want to have a particular rhetorical effect
And check back here for future explorations of language nuances that are often missed in text(-to-speech).