Truly realistic text-to-speech hinges on very subtle facets that everyone can hear, but only language specialists can ensure. This is all the more important with specific accents...
At rime, we employ our linguistic expertise to ensure that our accented voices faithfully recreate actual sociolinguistic characteristics. In this post, we delve into what makes our Boston accents pitch perfect.
There are countless accents of English in the United States alone as well as across the globe. Each has a myriad of characteristics that combine to create a unique way of speaking. At rime, we use our linguistic expertise to make sure that our accents are as true-to-life as possible, whether this be an upper midwest accent like those found in Fargo, or a traditional Boston accent. You can hear clips of these synthetic voices below:
Boston English
Let's focus on the Boston accent. The most salient characteristic of the Boston accent, to most Americans, is its 'lack of Rs'. In linguistics, we call the presence or absense of Rs a presence or absense of rhoticity.
However, it's not the case that Bostonian English lacks rhoticity everywhere. In fact, the patterning of rhoticity is subject to a variety of conditions. Let's consider one, and see how the Boston voices offered by rime obey this dialectical rhoticity rule.
We all know the classic sentence I parked the car in Harvard yard, but what is less known is that not all of the Rs in this sentence are created equal. In Boston English, Rs are dropped normally at the end of a word. However, when the following word starts with a vowel, the R is not dropped.
We can see this patterning below. Here's a clip of a synthetic voice saying the sentence My father was as a father is. The first instance of father is followed by a consonant 'W', and has the stereotypical R-droppping. The second instance of father is followed by a vowel, and actually has the R-sound fully realized.
Spectrogram Evidence
Linguists use a tool called a spectrogram to investigate the physical acoustics of human speech. The presence of an English R sound is characterized by the third frequency band dipping to below 2khz at the end of the word. We call this frequency band the third formant. For example, the spectrogram for the word 'bar' is shown below. Note the drop-off of the dark colored band across the word.
We can actually investigate our text-to-speech audio via spectrogram and see that the first father in that sentence has a flat third formant, indicating that there is no R-sound. But that the second instance of father, because it is followed by a vowel, has the tell-tale drop-off in the third formant.
father followed by was
father followed by is
In a word like square, a Boston English speaker will not only drop the R-sound, but also turn the vowel sound into a diphthong: two vowel sounds in a row. This is indicated by one of the lower formants, in this case the second, rising to indicate that transition from one vowel to the other. Our rime synthetic Boston voices not only correctly drop the R (as seen in the lack of drop-off in the third formant below), but the 2nd formant (the one below it), rises, indicating the diphthongization.
Wrapping up
This is just a quick glimpse into the subtleties of English accents, but there are countless other patterns that comprise each and every accent that rime offers. Getting these small things right is important not only to the believability of each accent, but also as a way to do justice to the millions of individual speakers of these accents. Keep an eye out for more deep dives into rime's ultra realistic language offerings!