Faster Rime Voice Demos
Jan 6, 2026
To give builders the best possible demo experience of the Rime voices on our website and web app, we just made a bunch of small orchestration level changes to shave off 300ms of latency from our voice agents. Follow along for some easy wins!
If you read to the end, we’ll share some fun linguistics lessons on latency in human conversation as well.
Note: We are using LiveKit to orchestrate the demos on the Rime website and web application. These demos use a cascading architecture (STT > LLM > TTS). For some easy starter code, fork this LiveKit repo.
If you are building with Pipecat by Daily, the same principles apply. Here’s a forkable Pipecat repo too.
In general with voice agent applications, most latency comes at the LLM layer. Changes to the LLM disproportionately affect the overall latency.
Without further ado, here are the changes we made.
STT model configuration
Note: We are using Deepgram for speech-to-text (STT).
Speech-to-text parameters. To reduce delays before processing audio, we adjusted several parameters to minimize delays in speech recognition. This causes the agent to start responding faster.
endpointing_ms=10: How long to wait after speech stops before finalizing the transcript (10ms is very short, allowing faster detection of speech end)no_delay=True: Disables buffering delays in the speech recognition pipelinesmart_format=False: Skips post-processing formatting steps that add latency
Endpoint parameters. We also added preemptive generation and tuned endpointing parameters to help the agent start responding faster while avoiding false interruptions. There is a tradeoff between latency and accuracy, but it’s worth it for the sake of latency.
min_endpointing_delay=0.15: Minimum time (150ms) before considering the user might be done speakingmax_endpointing_delay=2.0: Maximum time (2s) to wait before assuming speech has endedfalse_interruption_timeout=0.5: Time window (500ms) to detect and ignore false interruptionsmin_interruption_duration=0.3: Minimum duration (300ms) for a pause to be considered a real interruption
LLM model configuration
Note: We are using OpenAI as the large language model (LLM) for text understanding and response (natural language understanding and generation, NLU and NLG).
Model version. We switched from gpt-4o-mini to gpt-4.1-mini. After trying a lot of OpenAI API models, we found that gpt-4.1-mini was the best compromise between latency and quality. It was crucial to avoid reasoning models, because they take too long to start producing spoken output. For conversational speech, since latency is crucial, we tend toward smaller, faster, often specialized models instead of large, generally capable models.
LLM parameters. We set more deterministic parameters to reduce variability and improve response speed.
temperature=0.0: Picks the highest-probability completion at each step, thereby minimizing randomness and leading to more predictable/deterministic results. This reduces latency slightly by enabling greedy decoding. The model always picks the most likely token, eliminating the computational overhead of probabilistic sampling.top_p=1: Considers all possible token options when generating text, allowing for more creative output. Combining this with a deterministic setting of temperature=0 can maintain creativity and slightly speed up generation by simplifying the sampling process. Sampling computation is negligible compared to overall inference.
LLM prompts
Message prompts and context. We already keep message prompts as short as possible. This is best practice. You should ideally only send the content of the user’s speech (STT output) to the LLM, and if a user interrupts the agent, remove the unspoken portion of the response from the LLM context.
System prompts. We already limit response length in the system prompt instructions, but it’s worth calling out. In a cascading voice agent architecture, reducing the amount of text generated per LLM response significantly lowers overall latency since LLM output tokens are produced sequentially and dominate response time. You should limit responses to one to three sentences at a maximum.
To reduce LLM computation time, we dramatically shortened the portions of our system prompts with task/role and voice personality instructions, e.g. sample greetings and replies. We also shortened our style guidelines, e.g. which special characters to exclude in LLM output. With complex agents, it’s easy for a system prompt to reach into hundreds or thousands of tokens. And with voice agents, we want shorter responses, i.e. fewer overall tokens per conversation, possibly only a few hundred tokens for a full multi-turn conversation agent.
You can still get desired results with concise instructions, and giving the LLM less context improves speed. There’s a mix of art and science here, and we recommend following prompt engineering best practices.
Note: The "long context problem" can also be ameliorated by effective use of the KV-cache, but (1) that won't help on the first LLM response, and (2) to make it 100% reliable you’ll end up having to own more of the stack, which increases complexity.
Finally we ran regression tests to see if the new shorter prompts broke or degraded agent performance. This step is especially crucial for agents that use tools.
We’ll be updating the recommended prompts on our Voice Prompt Designer to reflect these system prompt changes.
Tool calling and MCP. It’s not relevant for these demos, but it’s worth mentioning that external API calls to slow, legacy systems can introduce a ton of latency. If you have to call slow APIs, think about how to run these calls in parallel with the rest of the conversation. Or you can instruct or fine tune your LLM to generate some text before calling the tool, because a lot of LLMs will write out their full tool call before they start streaming their text output.
Another strategy addresses the user’s perception of latency in addition to the actual latency. In real-world conversation, people will use “discourse markers,” “filled pauses,” and “floor holding” to signal they are not done speaking (scroll down for more linguistics notes). Two concrete examples would be playing pre-recorded audio like “Uh got it, let me look that up for you” or keyboard press sounds to let the user know you are taking action and plan to respond.
Agent orchestration server
It’s especially important to reduce any perceived latency before the first message in particular. On the front end, we pre-generate the credentials to connect to the LiveKit room. The room is the hosted session that connect one or more participants in conversation. By pre-generating these credentials, the initial connection is faster, which impacts the user’s perception of overall latency by playing audio sooner.
Pre-recorded assets
We pre-recorded greeting phrases, so that the initial turn of the conversation would be faster. While this doesn’t impact subsequent turns, it does improve the overall user experience. If users have to wait at the start, even if subsequent turns are fast, they’ll already feel frustrated. In customer service calls, the highest dropoff is within the first few seconds of the call, so these moments are critical.
Linguistics
In the book “Because Internet: Understanding the New Rules of Language” by Gretchen McCulloch, there’s a great bit about phatic expressions, i.e. the parts of a conversation when you’re on autopilot (e.g. “morning” “oh, hey what’s up?” “doing good, how ‘bout you?”). Believe it or not these are so automatic, that most people won’t even notice if you switch from one pattern (“how’re you doing?”) to another (“what’re you up to?”) within a conversation. McCulloch goes on to explain how these vary in spoken and online written communication. It’s really insightful but outside of the scope of this post.
So when you’re designing the initial conversational turn(s) and greeting(s), pre-recorded assets can work well to reduce latency.
Then for more information on latency and turn taking in conversation (plus lots of other fun linguistics data), we highly recommend the book “How We Talk” by N.J. Enfield. In human conversation, people often start a response before the other person finishes speaking, e.g. with nonverbal body language or backchannelling (“mhmmm”).
Typical response time in English is ~250ms and delays of more than 1 second are interpreted as a negative “no” message, so people use filler words like “um” and “uh” to respond more quickly while formulating thoughts and also to soften negative responses. This is about “signaling intent” in conversations, and “discourse markers” for “managing turn taking” are used to “take the floor” and “hold the floor” in a conversation. As an example, saying something like "oh yeah, one sec" is a discourse marker for establishing then confirming a clear channel and for signaling your intent to take the floor and then begin a period of uninterrupted speech. Following that with keyboard sounds or an "um" or “hmmm” would be called a filled pause, and lets the other conversation participant know you are not done with your turn.
Finally, people expect shorter responses (and affirmative “yeses”) to have lower latency, and longer responses (whether someone is recalling complex information or delivering a “no”) to have slightly longer response times.
So if your AI voice agent is taking a complex action (e.g. placing an order for a customer), longer latency can be natural, provided you respond quickly in other parts of the conversation and signal with another sound that you will be responding soon.
In summary
While each of these changes is small, in aggregate they create a much snappier voice agent experience. We hope these suggestions help as you’re building your own voice agent applications, and if you have any other latency hacks, we’d love to hear them. Just reach out to hello@rime.ai or post in the community Slack channel.
Thanks and happy building!
