Conversations are not just a simple turn-taking exercise. They actually involve subtle and frequent interjections which indicate active listening. This is called Back-Channeling and it's an important component to any naturalistic conversation.
How we Back-Channel
When your interlocutor is carrying-out their half of your conversation, there are numerous ways to show them that you are paying attention. You can smile and nod your head, or shake it as the context might call for. But you can also indicate verbally that you are following and this is all the more important in voice-only communication where the visual cues can't help.
Much like filler words which we have discussed earlier, back-channel words are usually fairly semantically vacuous little utterances that serve mainly to affirm and encourage the person talking and to indicate comprehension. These vary across cultures and languages, but in English the familiar ones are uh-huh, mmhm, and yeah. Check out this video of the fascinating and rare use of a sharp intake of breath to accomplish this in Norwegian.
Traditionally, TTS does not do a good job on these. This is because TTS has been previously trained on read-speech data. Read-speech is basically a monologue and conversational niceties like back-channelling simply aren't in the data. At Rime, we're changing the way TTS works and have the ability to deliver conversationally realistic and appropriate back-channel utterances, lightning fast at the speed of conversation. Here are a few isolated examples:
Who Back-Channels and When
As noted above, there are multiple modalities through which people can use back-channelling, but most relevant to TTS use-cases is the verbal, both because of the S in TTS and because the use of TTS is often over audio-only media. Because of this, it's important to look at the use of backchannling in that sort of context, for example, over the phone.
In a data-rich article from 2015, researchers from the UK and Italy investigated how people indicate that they are paying attention over the phone, through means of back-channeling along with other strategies like laughter, filler words, and others. They present many interesting results, but I'll note a few of the most relevant.
Women back-channel more than men
The researches show a clear difference in the rates of back-channeling between men and women in their study. Women employ laughter, back-channeling, and overlapping speech while their interlocutor is speaking, whereas men use fillers (like um) or silence.
This should be of interest for hyperrealistic AI conversation engineers. The text (static or LLM-generated) should vary to some degree between female voices and male voices, for the most true-to-life experience.
Receivers back-channel more than callers.
Another clear distinction the authors discovered was that the people being called used back-channeling more readily than the caller. This makes sense in that the caller had the impetus to initiate the conversation and as such presumably has more to say, and the recipient will have more opportunity to indicate that they are receiving the message.
This is relevant for TTS use-cases, which can generally be classified as "outbound" or "inbound" calls. It is more important for inbound TTS systems to employ back-channeling. For example, food ordering is an instance where the human user has the bulk of the message to convey (what they want to order) and the task of the responsive, realistic, and polite TTS system is indicate that the customer is heard.
Forward-looking automatic generation of back-channeling.
At Rime, we're dedicated to delivering generative conversation products that are powerful and useful here and now, but we're also always thinking ahead to what the future will bring. And there is a future where even more automatic and data-driven cues will be used to make AI conversation delightful and productive.
In fascinating work out of Japan, researchers have noted that back-channeling correlates with the acoustic pitch of the interlocutor. That is, people use back-channeling strategies timed with notably low-pitch utterances of the person they're talking to. The authors speculate a little bit as to why, but they don't have firm conclusions.
But with the increasing speed and power of AI conversation tech stacks, we see a future where ASR can detetect these drops in pitch and respond with an approriately timed "uh-huh" as a component of true-to-life AI conversation.
Wrapping Up
The conversational AI future is bright and you can trust Rime to surface easily-interpretable facts that can be intergrated into your products and drive compelling solutions and value.