Caruso Recording His Voice for RCA

Attacking the Clones

May 22, 2024 | Voice Clones, Synthetic Voices

We've all read the AI voice news recently, and it has raised interesting questions about identity and cloning in the generative AI space. At Rime, we have a vision for the future of AI voice and identity in general that obviates the perils and pratfalls of clones and their unintended effects.

Voice Clones are a Counterintuitive Hint about the Future

Whether it's Joe Biden's voice or Scarlett Johansson's, deepfakes, clones, or near-clones of real people are easier than ever to create. In fact, before starting the company, Rime founder Lily Clifford made a Barack Obama clone just for fun in her spare time as a mini proof-of-concept (he's still taking calls for Lily's voicemail). Anyone can make voice clones, it's not hard.

We don't want to entirely dismiss their importance. Clones are effectively impossible to detect (see this great blog post on the topic from Mirage Security to get a better sense why). And because they're so easy to make, it means that malign parties can and will be using them going forward.

But the ease and simplicity of basic voice-cloning is not the real story, it actually just hints at the real and exciting import of voice technology going forward.

Aside from a few trusted partners and with full consent, Rime doesn't do voice cloning. That's not where the real promise of this tech lies. The path forward is instead: autogenerated non-clone voices, fitting the performance and demographic needs of every single use-case.

But first, let's see how we got here.

The Early Era of Voice Fame

Born in 1873 to an impoverished family in what was then the Kingdom of Italy, singer Enrico Caruso would eventually become the first famous voice in history. In the early stages of recorded audio, where sound was captured on wax or foil cylinders, Caruso’s 1904 recording of "Vesti la giubba" was the first record in history to sell a million copies. The above image is Caruso's own sketched self-portrait of him recording his voice for RCA.

This set the stage for the next hundred years of audio culture: Because of the technical limitations to sound recording and playback, it was only financially viable to record and transmit a few select voices, and as a result, certain voices became what we now call "famous".

Over the next century, the cost and expertise required for voice recording limited the number of voices that could serve mass media. As a result, of the millions upon millions of voices that have existed over the last few decades, most western audiences were strangely intimately familiar with just a tiny few: Katharine Hepburn, James Earl Jones, Christopher Walken, and so on.

However, the increasing democratization of recording technology and social media has led to a proliferation of voices (along with nearly every other form of content) and fragmentation of the media landscape. It is no longer the case that a single musical act, or movie, or voice can dominate the attention of the whole society.

The Waning Era of Voice Fame

In the same way the technological limitations curtailed until recently the number of voices that could be cost-effectively recorded, so too have technological limitations prevented the flourishing of AI voices. Much like the "famous" voices of real people, the limitations on the number of synthetic voices served to create something like "fame" for them.

Before Susan Bennett provided the voice for Siri, there was "Fred". A voice for the Macintosh SimpleText early word processor. Though clunky, Fred was many people’s first introduction to the world of text to speech generation. And in virtue of being effectively the only voice available, Fred garnered some degree of what we might call "fame". He was memorably the lead voice in Radiohead’s 1997 Fitter Happier, where he read off the lyrics with disquieting aplomb.

Cloning the voices of celebrities belongs to this previous era, in the past, alongside the original Siri and Fred voices. The technological limitations that led to them no longer hold, and a new world, one without Voice Fame is here.

The Future of Voice, Data-driven utilization of a the infinite variation that AI offers

The technical walls constraining the proliferation of voices are coming down. This era of voice clones is a remnant of these limitations, and the power to tap into the countless varieties of voice and performance is just starting to be understood.

People have gravitated toward exploiting celebrity voices because you can assume a lot of people know and like the voice. That is, people are still making decisions about what voice to use based on their personal assumptions about what a mass audience wants to hear, and celebrity is a proxy for that. But what if we could optimize a voice experience for each person and each customer?

Instead of a few famous voices for personal assistants or customer support or advertizing, we're now able to generate any number of voices, tailored automatically for each individual user and use-case. The true promise of AI voice is not the creation of a handful of highly tuned voices. That would be a tragic waste of the possibilities AI offers and clones simply reinforce that uninspiring view. There is an astonishingly wide field of potential voice assets available, value lies in tapping into that.

The hubbub with celebrity voice clones is ephemeral, and Rime is unlocking the true power of voice technology. Stay Tuned.