TTS + Voice Best Practices
Feb 26, 2026
Introduction
This paper is aimed at shedding light on what it actually takes to build a successful voice AI system in 2026; not the demo, but the production deployment that handles real calls at real scale.
The voice AI market has moved fast. In 2024, most deployments were experimental. By mid-2025, we started seeing Independent Software Vendors (ISVs) cross the threshold from hundreds to thousands of concurrent production calls. The infrastructure decisions that worked at 50 concurrent calls break down completely at 2,000. This paper is about how to make the right decisions early so you're not rebuilding later.
The best place to start is with the inputs and variables. Before you write a single line of code or spin up a single container, you need to answer four questions:
Who is the voice system for? A startup building for themselves has fundamentally different needs than an ISV building for hundreds of customers.
Who is the end consumer? A patient calling a healthcare clinic behaves differently than a lead calling a solar company. The persona on the other end of the line shapes everything from voice selection to turn-taking behavior.
What is the expected concurrency? At any given point in the day, how many simultaneous calls does this system need to handle? A system designed for 20 concurrent calls and one designed for 2,000 are architecturally different animals. The answer changes your entire infrastructure strategy.
What is the goal? Containment? Sales conversion? Appointment booking? Issue resolution? This single question should drive every downstream decision, from which LLM you use to how you evaluate your TTS.
Too many teams skip these questions and jump straight to "which model sounds the most human." That's how you end up rebuilding six months later.
About Rime’s Perspective
At Rime, one of the most common types of customers we serve is what the industry calls ISV voice systems. ISV stands for Independent Software Vendor. What this means in practice is that we often work with companies who are building voice systems for other businesses. This means high-volume, complex infrastructure, multi-tenant architectures, and more points of failure than a single-use deployment.
Over the last couple years, we've done hundreds of voice deployments and have seen firsthand the evolution of AI voice from novelty to production infrastructure. Our expertise lies in general orchestration, LLMs, and TTS. Everything in this paper is colored by that lens. If you're building a single voice assistant for your own product, much of this will still apply, but the stakes and complexity are different.
Orchestration: The Foundation
For ISVs, the first problem to solve is orchestration infrastructure. This is the layer that coordinates the flow between speech recognition, the language model, text-to-speech, and the telephony layer. It manages streaming audio, handles interruptions, routes tool calls, and keeps the entire pipeline synchronized in real time. Get this wrong and nothing else matters.
The decision is almost always build vs. buy. Or more accurately: build from scratch vs. orchestration platform (e.g. LiveKit and Pipecat).
LiveKit’s platform is built on its open-source AgentFramework. This framework gives companies the granularity needed to build voice systems that scale. LiveKit is, first and foremost, a developer tool. The goal is to expose every dial and knob so you can build exactly what you need rather than contorting an out-of-the-box solution to fit your use case.
Pipecat by Daily occupies a similar space, open-source, pipeline-oriented, and developer-friendly, though with a different architecture philosophy. It's a strong option, particularly for teams that want tight control over the processing pipeline.
The Most Common Misstep
The most common mistake we see at this stage is picking the wrong orchestration provider for your use case. ISVs in particular need granularity and flexibility. You need sub-accounts, complex data routing, custom event handling, and the ability to build a full multi-tenant system on top of a framework, not just configure one.
We've seen it time and time again: a large company picks a simple off-the-shelf solution marketed to startups, gets a demo working in an afternoon, and then runs into serious issues the moment they try to scale, customize, or hand control to their own customers. While many of these tools are incredible for building and validating your MVP or v1, the demo-to-production gap with these tools can be enormous. What works at 10 concurrent calls with a single use case falls apart when you need 50 different agent configurations, per-customer billing, and real-time monitoring across thousands of sessions.
In my experience, picking the right orchestration provider is one of the single most consequential decisions you'll make. You need it to be flexible, extensible, and built for the long haul.
Building Your Own
From time to time, we see companies build their own orchestration from the ground up. The long-term viability of this approach is highly dependent on the in-house technical capabilities of your company.
Pros: You're not reliant on an orchestration vendor's roadmap. You can scale at your own pace. You have ultimate flexibility and can optimize for your specific traffic patterns. For companies with deep real-time audio expertise, this can work.
Cons: It is genuinely hard to keep up with the market. Every month there are new models, new streaming protocols, new ways to handle interruptions and turn-taking. In the last year alone, we've seen the introduction of multiple new ASR streaming modes, new TTS streaming formats, speech-to-speech models, and changes to how major LLM providers handle function calling in streaming contexts. LiveKit and Pipecat have entire teams solely focused on these problems. Matching that pace of innovation while also building your core product is a tall order. Some teams that go this route end up spending more engineering time on orchestration maintenance than they anticipated, and then fall behind on model and protocol support within 3-6 months.
Our general advice: unless orchestration is your core product, or you have a very specific technical requirement that existing frameworks can't accommodate, you're better off building on top of LiveKit or Pipecat and focusing your engineering effort on the application layers where you are highly differentiated.
Hosting: API vs. Self-Hosted
Hosting is a hot topic, and the question is straightforward: should you consume models via API, or self-host, and what are the tradeoffs?
We strongly believe that as the market matures, more and more enterprises and ISVs are going to be self-hosting both the orchestration framework and the models themselves. If you have the infrastructure team to maintain it, the benefits are significant:
Performance: Self-hosted models eliminate the round trip to a third-party API. For TTS specifically, this can shave 50-100ms off your TTFB, which matters when you're trying to keep total turn time under 1 second. You also eliminate the variance that comes from shared API infrastructure; your p99 latency becomes much more predictable.
Cost: API pricing for TTS and LLM inference is designed for margin. When you self-host, you're paying for compute, plus a small nominal usage fee. The savings compound rapidly at scale.
Control: Data residency, uptime guarantees, and the ability to run in your own VPC. For healthcare, finance, and insurance ISVs, this often isn't optional; it's a requirement.
At Rime, a large majority of our volume runs through self-hosted deployments. We had early traction here and have doubled down. Customers get our Docker containers and run them in their preferred environment, whether that's AWS, GCP, Azure, or on-prem. Typical setup time is under a day for teams with container infrastructure already in place.
As for self-hosting the orchestration framework itself, this is fairly common with LiveKit and Pipecat. We've yet to see meaningful shortcomings. The cost savings are typically well worth the added infrastructure complexity.
The Economics
Here's where the numbers get concrete. Due to our model architecture, customers incur new replica costs on average at every additional 200 concurrent calls. That sounds like a constraint until you compare it to the alternative: per-stream concurrency pricing at providers like ElevenLabs or Cartesia, where costs scale linearly with every additional concurrent call.
By moving to self-hosted TTS, we routinely see ISVs cut their TTS bill by 5x. To put a finer point on it: an ISV running 1,000 concurrent calls at peak on a per-stream API model might be spending $15,000-25,000/month on TTS alone. Self-hosted with Rime, that same volume typically runs at $3,000-5,000/month in compute costs. At 5,000 concurrent calls, the gap widens further. For ISVs whose entire margin depends on the cost-per-minute of each call, this is often the difference between a viable business and one that bleeds money at scale.
The Model Stack: Why Speech-to-Speech Doesn't Work at Scale
Let's say you've picked LiveKit for orchestration. Now you need to decide on your model stack. Somewhere, you heard that OpenAI Realtime is good. Maybe you saw the demo. It was impressive.
Here's the spoiler: speech-to-speech models do not work for any scaled production use case today. It's one of the most common patterns we see. An ISV walks in the door already using OpenAI Realtime or another speech-to-speech model, and it's just flat-out not working. I'd estimate that 30-40% of new ISVs we talk to have tried this path and are in the process of migrating off it. The reasons are consistent:
Observability: You can't see the granular pipeline. In a traditional ASR → LLM → TTS stack, you can inspect what was transcribed, what the model generated, and what was spoken. With speech-to-speech, it's a black box. When something goes wrong, and it will, you have no idea where the failure occurred. Was the caller misheard? Did the model hallucinate? Did the voice garble the output? You can't tell.
Control: You have no ability to audit or intervene between the text the model generates and what the agent actually says. For regulated industries, compliance-sensitive use cases, or frankly any ISV whose customers need to trust what the agent is saying, this is a non-starter. You can't log the "transcript" of what the agent said because no text transcript exists in the pipeline.
Latency with tool calls: The moment you need to do any external tool call, check a database, look up an appointment, hit a CRM, the average turn balloons to 3-4 seconds. For context, research on conversational turn-taking suggests that gaps over 700ms start to feel unnatural to callers, and anything over 1.5 seconds feels broken. 3-4 seconds is an eternity.
Cost: Speech-to-speech models are significantly more expensive per turn than a well-optimized ASR + LLM + TTS pipeline. OpenAI Realtime pricing, for example, runs substantially higher per minute than the equivalent pipeline approach, and you have far fewer levers to optimize.
The traditional pipeline, ASR → LLM → TTS, gives you modularity. You can swap any component independently, optimize each stage, and maintain visibility into the entire flow. That modularity is what makes production voice systems debuggable, auditable, and improvable over time.
Evaluating Text-to-Speech
The TTS evaluation space is nascent, and frankly, most companies are doing it wrong.
The overwhelming default is subjective opinion: "Does this voice sound human?" Teams will sit in a room, listen to samples, and pick the one that feels best. We've surveyed how our customers initially approach TTS selection, and the vast majority start with some version of a subjective listening test. Some have started using LLM-as-a-judge approaches, which is a step in the right direction but still misses the fundamental point.
Here's my strong opinion on this: very few companies are evaluating TTS correctly, and the ones who do have a significant competitive advantage.
Goal-Driven Evaluation
To do TTS evals right, you have to go back to the fundamental question: what is the goal of the voice system?
Is the goal to contain someone, to resolve their issue without transferring to a human? Is the goal to solve their problem, measured by resolution rate or customer satisfaction? Is the goal to sell them something, measured by conversion rate or appointment set rate?
That goal is what should drive your evaluation. Not "does it sound human." The real question is: which voice accomplishes the system's goal most often?
You might be surprised by the results. We've seen cases where the voice that sounds most natural in a side-by-side listening test actually performs worse on conversion than a voice with different characteristics: slightly faster pacing, different vocal energy, a more authoritative tone. In one deployment, switching from a "warm and friendly" voice to a "bored and GenZ" voice increased containment rates by over 25%, even though the team unanimously preferred the warm voice in blind listening tests. The human ear is often a bad proxy for business outcomes.
Building an Eval Harness
Once you internalize this, the next step is building a testing harness that can A/B test voices against your actual KPIs in production or near-production conditions. This means routing a percentage of live calls to different TTS configurations, tracking outcomes per cohort, and running the test long enough to reach statistical significance (typically 1,000+ calls per variant, depending on your baseline conversion rate).
When you have that infrastructure, you unlock something powerful: you can move from evaluating voices to evaluating vendors. You can test new models from any provider against your real-world metrics and make data-driven decisions about your TTS stack.
When buying TTS, you want two things: a good partner who will work with you on the hard problems, and economics that work at your scale. The eval harness is how you prove out both.
The LLM
This is the part of the stack I find most interesting right now, because it's where I see the biggest shift happening.
Historically, the default has been OpenAI. GPT-4o or GPT-4o-mini for most voice use cases, with GPT-4o-mini being the more common choice due to its lower latency (typically 200-400ms TTFB vs. 400-800ms for GPT-4o). We see some teams using Claude, though latency has often been a barrier for real-time voice. But the more interesting trend is the move away from frontier models altogether.
The shift toward fine-tuned, lightweight, purpose-built models is real. Companies are realizing that they're paying for 100% of a frontier model's capabilities but using 0.01% of them. Voice workflows have fundamentally different constraints than a text-based chat interaction:
Responses need to be short. A voice agent that generates four paragraphs is useless. The ideal voice response is 1-3 sentences. You need concise, conversational output, and you're essentially paying frontier-model prices for 20-50 tokens of output per turn.
The use case is bounded. The agent is pulling from a specific database or knowledge set, not answering arbitrary questions about the world. A dental appointment booking agent doesn't need to know about quantum physics. The narrower the domain, the more a fine-tuned smaller model can match or exceed a frontier model's performance on that specific task.
Latency is a first-class concern. Every millisecond of LLM inference time is a millisecond the caller is waiting in silence. A fine-tuned 7B parameter model running on a single GPU can deliver TTFB under 100ms. That's 3-5x faster than a frontier API model, and that latency savings cascades through the entire user experience.
Turn-taking behavior matters. The model needs to understand when a response is complete in a conversational context, not just when it has finished generating. This is a learnable behavior that fine-tuning handles well.
With these constraints in mind, there's a strong case for fine-tuning an open-source model (Llama, Mistral, Qwen, etc.) on the customer's specific dataset and use case. In many cases, a well-tuned 7B or 13B parameter model will outperform GPT-4o on the specific task while being dramatically faster and cheaper, often at 10-20x lower cost per token when self-hosted.
The tradeoff is the upfront investment in fine-tuning infrastructure (typically 1-2 weeks of engineering time for the first model, less for subsequent ones) and the ongoing cost of maintaining and updating the model as the use case evolves. But for ISVs running high volume across many customers, this is increasingly the right move.
Pronunciation
Pronunciation may seem like a detail, but it's one of the highest-impact quality issues in production voice systems, and it's the most common value proposition we hear about at Rime.
The problem is straightforward: TTS models mispronounce things. Names, addresses, medical terms, product names, acronyms, anything domain-specific is at risk. In a text interface, misspelling is annoying. In a voice interface, a mispronunciation can be confusing, unprofessional, or even dangerous. Imagine a healthcare agent mispronouncing a medication name, or a financial services agent mangling a customer's name on every call. These aren't edge cases; they're the norm with most TTS providers on domain-specific content.
The severity scales with the use case. For a general customer service bot, the occasional mispronunciation is forgivable. For a medical triage line or a legal intake system, it's a liability.
The Determinism Problem
What makes this worse with many TTS providers, such as ElevenLabs, is that the mispronunciation isn't even consistent. The model will pronounce the same word differently across calls, sometimes correctly, sometimes not. This makes the problem incredibly hard to debug and impossible to trust. You can't just "fix" a pronunciation if the model might revert to the wrong version on the next call.
For our Mist model specifically, we built an observability and control layer for pronunciation from the ground up. This starts with a robust text normalization layer that gives customers deterministic pronunciation. The same input always produces the same phonetic output. To my knowledge, Mist is the only model on the market with this level of pronunciation control.
The workflow looks like this: if the model mispronounces something, we surface it to you. You correct it, typically in under a day through our Speech QA product. From that point forward, it pronounces it correctly, every time. No retraining, no hoping it fixes itself, no waiting for a model update.
When evaluating pronunciation across TTS providers, don't just test initial accuracy. Test the correctability. How fast can you fix a mispronunciation, and how reliably does the fix hold? A model with 95% initial pronunciation accuracy and a fast, deterministic correction mechanism is more valuable in production than a model with 98% accuracy that you can't fix when it gets something wrong.
Latency
Latency is one of the most common topics that comes up in every voice deployment conversation. Everyone wants to reduce it, and rightfully so. It's the difference between a conversation that feels natural and one that feels like talking to a slow IVR.
Given that Rime only builds TTS models, we can't control the full end-to-end turn time. But we can share what we've learned from hundreds of deployments about where latency actually lives.
The Latency Budget
In a typical ASR → LLM → TTS pipeline, here's how the latency breaks down for a standard turn (caller asks a question, agent responds):
Component | Typical Range | What Drives It |
ASR (endpoint detection + transcription) | 100-400ms | Model choice, endpoint detection sensitivity, audio chunking strategy |
LLM (time to first token) | 150-800ms | Model size, prompt length, whether tool calls are needed, API vs. self-hosted |
Tool calls (if any) | 200-2,000ms | External API latency, database query time, number of sequential calls |
TTS (time to first byte) | 80-300ms | Model architecture, self-hosted vs. API, audio format |
Network/telephony overhead | 50-150ms | Geographic distance, telecommunications API processing, codec transcoding |
Total (no tool call) | ~400-1,200ms | |
Total (with tool call) | ~800-3,000ms+ |
Target for a natural-feeling conversation: Under 800ms total turn time for simple responses, under 1,500ms for responses requiring a tool call. Research on human conversational patterns shows the average gap between turns in natural speech is approximately 200ms, so anything under 500ms total feels remarkably fluid, and anything over 1,500ms starts to feel noticeably slow.
The LLM is almost always the dominant contributor, especially when tool calls are involved. A single tool call can add 500ms-2s depending on the model and the external service being called. This is why we see a lot of teams moving to pre-fetching strategies (anticipating what data the agent will need and fetching it before the caller finishes speaking) and parallel tool calls where possible.
The 200ms TTS Threshold
On the TTS side, we're obsessed with bringing TTFB down. Rime's Mist model currently delivers sub-150ms TTFB in self-hosted deployments, and our cloud API typically runs at 150-200ms depending on region.
That said, we've observed a practical threshold: once we got below 200ms TTFB at the machine level, we stopped seeing significant KPI improvements in conversion or containment metrics. The caller can't perceive the difference between 150ms and 100ms of TTS latency because it gets absorbed by the other components in the pipeline.
We continue to push the numbers lower because the real benefit is giving you more headroom elsewhere. If your TTS is fast enough to be invisible, you have more room for the LLM to think, more room for tool calls, and more room for network variability, all without the caller noticing. Think of TTS latency optimization as buying yourself budget that you can spend on richer LLM reasoning or more complex tool integrations.
Latency Optimization Strategies
Beyond choosing fast models, here are the most impactful latency reduction strategies we see working in production:
Streaming everywhere. The entire pipeline should be streaming. The LLM streams tokens to TTS, TTS streams audio to the telephony layer. No component should wait for a complete output before starting its work.
LLM prompt optimization. Shorter system prompts, fewer few-shot examples, and tighter instructions reduce both TTFB and total generation time. We've seen teams cut 100-200ms off LLM TTFB just by trimming their system prompt from 2,000 tokens to 500.
Pre-fetching. If you can predict what data the agent will need (e.g., the caller's account info once you have their phone number), fetch it before the conversation even starts or during the caller's utterance.
Connection pooling and keep-alive. Cold connections to API providers add 50-150ms. Keeping persistent connections eliminates this entirely.
Telephony
The best options here are Telnyx, Twilio and SignalWire. They've done telephony well for the past decade. Reliability is high (we consistently see 99.9%+ uptime on their trunks) and the SIP trunks are solid enough for most use cases. The market share in programmable voice for each of these companies means you'll find community support, documentation, and battle-tested integrations for almost any scenario.
HD Voice and Wideband Trunks
There's been a notable rise in 16kHz (wideband) use cases, where customers procure wideband SIP trunks and deliver HD voice to callers. The difference is meaningful: 16kHz audio captures frequencies up to 7kHz compared to narrowband's 3.4kHz ceiling, resulting in a richer, clearer audio experience. Vowels sound fuller, fricatives (s, f, th sounds) are clearer, and the overall impression is dramatically more "present."
But it's also a different experience, and one that can actually expose TTS artifacts that are masked at 8kHz narrowband. Compression artifacts, unnatural breathiness, and robotic undertones that are inaudible at narrowband become noticeable at wideband. If you're moving to 16kHz, make sure you're testing your full pipeline at 16kHz, not just assuming it'll sound better. You may need to re-evaluate your TTS voice selection entirely.
New Entrants
Most recently, LiveKit added support for phone numbers and SIP trunks directly within their platform. This product isn't fully built out yet, but the direction is clear: they want to own more of the stack. If they execute well, it could simplify your voice application architecture.
What's Next
The voice AI infrastructure stack is maturing fast, but we're still early. The companies that will win in this space are the ones making principled infrastructure decisions today: choosing flexible orchestration, self-hosting where the economics make sense, evaluating TTS on business outcomes rather than vibes, and moving toward purpose-built LLMs that are optimized for conversational speed rather than general-purpose breadth.
If there's one takeaway from this paper, it's this: the decisions that seem like implementation details, which orchestration framework, how you host your models, how you evaluate voices, these aren't details. They're the foundation. Get them right and you can iterate on everything else. Get them wrong and you'll spend the next year rebuilding.
