Introducing Rimecaster

    Rimecaster is a new open source training model to make voice AI sound more human

    At Rime, we’re driven by the belief that foundational models should reflect the full richness and diversity of how people speak in the real world. 

    That’s why we’re launching Rimecaster, the first open source speaker representation model based on a massive dataset of full duplex, multi-lingual speech data recorded from conversations with everyday people. Rimecaster represents a big improvement in how voice AI models will be trained, and it’s available on HuggingFace now.

    Rimecaster announcement video

    First some background, then we’ll share how to get started, plus insights on our approach to data and model architecture.


    Speaker Representation Models: a critical piece in voice AI model training in need of improvement

    Speaker representation models take recorded voice data and break it down to vector embeddings. If people sound similar, say construction workers from the south side of Boston or PhD startup founders who recently immigrated from India, that similarity can be mathematically represented by comparing their speech vectors.

    When training new speech synthesis models, one of the first steps is converting recorded audio input into vector embeddings. You want to get both characteristics of the speaker (accent, demographics, etc.) and how the content of the speech (tone, prosody, etc.) maps to the words spoken. This is how models can generate speech in someone’s voice even for novel words that were not in the training dataset.

    But the problem is that all of today’s widely-used open source speaker representation models are trained on incomplete sources of audio data, particularly biased towards podcast hosts and audiobook narrators. These speakers put on a performance voice that’s far from everyday speech. When you get vector embeddings from these models, they’re just not that accurate for natural conversation.

    So, unless you’re training models to generate podcasts or audiobooks, you’ll get far better performance when using embeddings from Rimecaster.

    Get Started: Rimecaster now available on HuggingFace

    Pretrained checkpoints for Rimecaster and example inference code, compatible with NEMO, can be found on HuggingFace here. And to further our goals of democratizing diverse speech modeling, we are releasing Rimecaster with a CC-by-4.0 license.

    To get started, you’ll need to install NVIDIA NeMo if you haven’t already:

    pip install nemo_toolkit['all']

    Then instantiate the model and extract the embeddings:

    import nemo.collections.asr as nemo_asr
    speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained("rimelabs/rimecaster")
    emb = speaker_model.get_embedding("an255-fash-b.wav")

    You can find the rest of the README on HuggingFace.

    Training Data: trained on in-studio conversations with everyday people

    Let’s talk about data. Rimecaster is trained on the world’s largest proprietary dataset of full duplex, multi-lingual speech data recorded from conversations with everyday people in Rime’s downtown San Francisco studio and other locations around the US. Rime has an order of magnitude more full duplex data than massive public companies.

    But it’s not just about collecting and recording studio quality conversation data, getting the highest quality transcriptions is critical. Many new voice model companies rely primarily on “silver-level” transcriptions, which are machine generated with little or no human review and are only 80-95% accurate. Rime employs a team of PhD annotators to get “gold-level” transcriptions, which are 98-100% accurate.

    With Rimecaster for speaker representation and the highest quality audio plus transcription dataset, Rime is able to offer the most natural and lifelike voices on the market. Learn more.

    Model Architecture: built on NVIDIA Titanet and compatible with NeMo

    Our work builds on the foundation laid by NVIDIA's Titanet (2021), with critical enhancements tailored to our goals.

    Rimecaster architecture diagram

    Architecturally, Rimecaster mirrors Titanet but expands the output dimensionality from 192 to 768 channels per embedding. This upgrade allows for significantly richer speaker representations, capturing subtle nuances in vocal identity and speaking style. Rimecaster was trained on diverse, real-world conversations collected by Rime, enabling it to generalize more effectively across speaker identities, accents, and languages.

    In internal evaluations, Rimecaster achieves lower Equal Error Rates (EER) compared to previous models. But most importantly, Rimecaster is already powering our multi-speaker text-to-speech systems, leading to substantial quality gains over earlier pretrained speaker encoders. These improvements are especially pronounced in low-resource speaker settings and in speech generation scenarios requiring high fidelity and speaker preservation.

    from nemo.collections.asr.models.label_models import EncDecSpeakerLabelModel
    rimelabs_model = EncDecSpeakerLabelModel.from_pretrained("rimelabs/rimecaster")
    audio_path = "AUDIO_FILE_PATH"
    # Extract embeddings
    embeddings = rimelabs_model.get_embedding(audio_path)
    # Print the shape of the extracted embeddings
    print(f"Extracted embeddings shape: {embeddings.shape}") # [1, 768]

    In short, Rimecaster helps new voice AI models understand and represent real human voices, the messy, beautiful, everyday kind. This makes speech models more accurate, more inclusive, and better at mimicking individual voices with nuance and personality. It's like teaching the AI to listen more like a human.

    With Rimecaster, we're one step closer to making speech generation truly sound like everyone!