The challenges of benchmarking TTS

Is beauty in the ear of the beholder?

Most voice AI models today sound good—but what makes one voice better than another? And how do you actually measure that?

For years, researchers have relied on MOS (Mean Opinion Score), where people rate voices on a scale. But the recent paper, "Stuck in the MOS pit" by Kirkland et al., highlights a big problem: MOS is all over the place. The same voice can get wildly different scores depending on how the test is set up—what questions are asked, what scales are used, and even what "quality" means to different listeners.

As new models get better and better, most can game the system and get a "good" MOS score.

Latency, on the other hand, is easy to measure, and Rime consistently wins on speed.

At the end of the day, a single score won’t tell you if a voice will drive business impact or resonate with your audience. You shouldn't pick a voice based on an arbitrary evaluation or just because it feels right—which is still the state of the art for many companies—you should pick it because it increases success or conversion.

What's the best way to evaluate a voice? Put it in front of real people and measure the impact. Does it increase engagement? Success? Drive conversions?

Those are the metrics that matter. And that is what Rime's most forward-thinking customers do.

So instead of asking which AI voice gets the highest MOS score, ask: which voice wins?

And may the best voice win!