How to Change your Accent

A person's accent is one of their most personal and unique characteristics. At Rime, we think no one should have to change theirs. But synthetic people on the other hand...

We've split the atom! (of timbre and accent)

Here in the Rime lab, we've developed the ability to swap out a speaker's original accent and replace it with a different one.

Listen to audio00:00

Baseline accent

Listen to audio00:00

Transfered Texas Accent

This is a game-changer. Let's look into the details.

What Goes Into How You Speak

The way that an individual sounds when they speak is determined by two major factors:

The shape of their vocal tract, which contributes to their personally identificable timbre
The particular phonological rules and sound inventory of their idiolect, which contributes to their accent.

However, teasing these two apart in generative speech synthesis is pretty tricky. When a person speaks, the output is a complex sound wave, which comprises the phonetic particularities of both their timbre and accent. Seen below is the spectrogram of the baseline audio above. The way the synthetic speaker pronounces his words depends on both components, and it's not (currently) possible to isolate them.

"it's easier than you might expect to change your accent."

But note the blue line in the image. This represents the pitch of the speech and this is one fair proxy for his unique timbre.

Another good proxy for timbre is the fourth formant, the topmost thick black band in the above image. We can measure that formant for each transferred accent and compare them to each other.

With that said, simply listening to the audio can be a best approximation for the retention of individual timbre across accents.

The average pitch of the baseline audio above is 98Hz and that of the Texan accented version of the same speaker is a very similar 100Hz. The average fourth formant for the baseline is 3731Hz and the Texan version is 3745Hz, again extremely close. Let's look at some more below.

A Range of Accents for one Speaker

Below, we've taken that same speaker (whose baseline is a fairly standard Californian accent) and transferred other accents onto his voice. Of course, this process can go in any direction: from Texan to Australian, from Boston to AAVE. For each, I've included the average pitch (p) and average fourth formant (F4) measurements, both measured in Hz.

Listen to audio00:00

baseline p=98, F4=3731

Listen to audio00:00

Texas p=100, F4=3745

Listen to audio00:00

Australian p=100, F4=3721

Listen to audio00:00

Boston p=102, F4=3754

Listen to audio00:00

AAVE p=99, F4=3769

Listen to audio00:00

Indian p=105, F4=3738

Listen to audio00:00

British p=106, F4=3773

Another speaker

Of course, the same accent transfer can be done with female speakers as well. The main difference between male and female speakers again lies in pitch and the fourth formant.

Listen to audio00:00

baseline p=192, F4=4078

Listen to audio00:00

Texas p=197, F4=4029

Listen to audio00:00

Indian p=200, F4=4089

Listen to audio00:00

British p=190, F4=4096

Listen to audio00:00

Southern AAVE p=184, F4=4085

Wrapping Up

There's more work to be done in this realm, but we're really excited about the possibilities here, whether it's for programmatic advertising, demographic-specific call automation, or anything else you might think up! Keep following the Rime blog for more updates!