How to Change your Accent

    Rime's new tech swaps accents while preserving speaker identity, enabling lifelike voice transformations for AI, call automation, and more.

    A person's accent is one of their most personal and unique characteristics. At Rime, we think no one should have to change theirs. But synthetic people on the other hand...

    We've split the atom! (of timbre and accent)

    Here in the Rime lab, we've developed the ability to swap out a speaker's original accent and replace it with a different one.

    00:00/00:00
    Baseline accent
    00:00/00:00
    Transfered Texas Accent

    This is a game-changer. Let's look into the details.

    What Goes Into How You Speak

    The way that an individual sounds when they speak is determined by two major factors:

    1. The shape of their vocal tract, which contributes to their personally identificable timbre

    2. The particular phonological rules and sound inventory of their idiolect, which contributes to their accent.

    However, teasing these two apart in generative speech synthesis is pretty tricky. When a person speaks, the output is a complex sound wave, which comprises the phonetic particularities of both their timbre and accent. Seen below is the spectrogram of the baseline audio above. The way the synthetic speaker pronounces his words depends on both components, and it's not (currently) possible to isolate them.

    "it's easier than you might expect to change your accent."

    But note the blue line in the image. This represents the pitch of the speech and this is one fair proxy for his unique timbre.

    Another good proxy for timbre is the fourth formant, the topmost thick black band in the above image. We can measure that formant for each transferred accent and compare them to each other.

    With that said, simply listening to the audio can be a best approximation for the retention of individual timbre across accents.

    The average pitch of the baseline audio above is 98Hz and that of the Texan accented version of the same speaker is a very similar 100Hz. The average fourth formant for the baseline is 3731Hz and the Texan version is 3745Hz, again extremely close. Let's look at some more below.

    A Range of Accents for one Speaker

    Below, we've taken that same speaker (whose baseline is a fairly standard Californian accent) and transferred other accents onto his voice. Of course, this process can go in any direction: from Texan to Australian, from Boston to AAVE. For each, I've included the average pitch (p) and average fourth formant (F4) measurements, both measured in Hz.

    00:00/00:00
    baseline p=98, F4=3731
    00:00/00:00
    Texas p=100, F4=3745
    00:00/00:00
    Australian p=100, F4=3721
    00:00/00:00
    Boston p=102, F4=3754
    00:00/00:00
    AAVE p=99, F4=3769
    00:00/00:00
    Indian p=105, F4=3738
    00:00/00:00
    British p=106, F4=3773

    Another speaker

    Of course, the same accent transfer can be done with female speakers as well. The main difference between male and female speakers again lies in pitch and the fourth formant.

    00:00/00:00
    baseline p=192, F4=4078
    00:00/00:00
    Texas p=197, F4=4029
    00:00/00:00
    Indian p=200, F4=4089
    00:00/00:00
    British p=190, F4=4096
    00:00/00:00
    Southern AAVE p=184, F4=4085

    Wrapping Up

    There's more work to be done in this realm, but we're really excited about the possibilities here, whether it's for programmatic advertising, demographic-specific call automation, or anything else you might think up! Keep following the Rime blog for more updates!