Howdy

When Y'all is Said and Done

April 16, 2024 | Dialects, Conversation

Southern US English is far and away the largest and most commonly spoken regional dialect in the United States. And its spoken characteristics get right to the heart of conversational AI challenges. Getting the difference between text and speech right is of prime importance for speech synthesis.

There are many salient features of Southern US English, many that we're all aware of, either implicitly or explicitly, either realistically or exaggerated. Some characteristics are employed by some Southern speakers and not others, and some characteristics, like the second-person plural pronoun y'all, have even been self-consciously adopted by non-Southern speakers to some extent.

This stuff is especially important as a case-study in the differences between text and speech. For one, a handy feature of text is that it collapses differences in pronunciation, but this also obviously means text-to-speech needs to re-inject those pronunciation differences back into the text, and do so correctly.

An additional factor is that written text (both online and off) overwhelmingly conforms to standard English patterns. But in largely spoken dialects like Southern US English, there are robust and subtle patterns that basically never show up in a text corpus. So having text as the foundation to a conversational model demands careful attention to these patterns that aren't going to be there in the data that LLMs are generally trained on, and not going to be in most script-writers' toolbelt.

If you ask ChatGPT to produce some Southern-style sentences, it mostly just fills the output with home-spun sounding phrases and plenty of Eye Dialect spellings. This is basically just a cartoon for the entertainment of non-Southern speakers. True Southern Syntax is far more interesting.

What ChatGPT thinks Southern English looks like

Southern Syntax

Before any conversational AI can hope to get Southern English pronunciation right, it has got to get Southern Syntax right. Let's look into a few salient features here:

One of the best resources for any foray into North American English is the Yale Grammatical Diversity Project English in North America, founded by Raffaella Zanuttini. Over the years, our colleagues at Yale have compiled the largest and most in-depth collection of American English variation around, and we strongly urge everyone to take a look around in their extremely user-friendly website.

The Rough Extent of Southern US English

Multiple Modals

An important feature of Southern US English are so-called multiple modals. A modal verb (like can, should, might, will etc.) will generally indicate the varying degrees of possibility, necessity, or obligation of some predicate. Multiple modal constructions are the notable feature of Southern English where 2 or more modals are strung together in a single verb phrase:

"I might could use some more coffee"

"I reckon I might should better try to get me a little bit more sleep.

might should

What Pairs Arise? As noted in the cited article (written by fellow UMD ling alum Nick Huang), the most common instances of these multiple modals are might could, might can, and might would, according to Mishoe and Montgomery 1994, and they predominantly involve may or might as the first of the 2+ modals.

How Do Questions and Negation Work? In any variety of English, we move our modals to the front of the sentence when forming a question: "she might do that" becomes "might she do that?". When forming a question with double modals, it is interestingly the 2nd of the two modals to be moved to the front of the sentence:

Could you might __ go to the store for me? (Tennessee, Hasty 2011)

Similarly, it is the 2nd modal that gets negated in negative sentences:

I was afraid you might couldn't find it. (Texas, Di Paolo 1989)

Personal and Presentative Datives

Two more, arguably related, phenomena are Personal Datives and Dative Presentatives. Both are characteristic of Southern US English and both involve personal dative pronouns like me, her, you, etc.

Personal Datives are pronouns that occur right after the verb and they refer to the subject of the sentence

I'm gonna write me a letter to my cousin. (Christian 1991)

She made her a cake for her sister.

Some interesting constraints on the Personal Dative are:

  • They must occur adjacent to the main verb of the clause
  • They must refer to the subject of the sentence
  • They can involve any dative personal pronoun except it
  • They must occur in a transitive sentence ("I'm gonna sing me a song" is OK, "I'm gonna sing me" is not)

A similar phenomenon is the Dative Presentative, which also involve dative personal pronouns, but in this case, they are used following a presentative word like here's or there's:

Here's you some water.

There's me an idea.

here's you an example

Here's you a map of where these are attested, created by Yale linguist Jim Wood

Dative Presentatives Map

Southern Pronunciation

As said above, text flattens accents. This is of course extremely useful for text communication: regardless of whether the writer/reader is from Dublin, Jamaica, Bangalore, Peoria, or Pretoria, they can communicate using the same text even though they might have difficulty understanding each other if they spoke the same text aloud. But for TTS applications, we need to re-imbue the silent text with the appropriate pronunciation.

That said, no accent is monolithic. Speakers are grouped together as sharing an accent based on some commonly shared features, but not all speakers use all characteristic features all the time. Nevertheless, there are some major features that strongly indicate one accent versus another.

The pin/pen Merger

pin/pen merger map from Wikipedia

One of this most salient characteristics of Southern US English is the so-called pin/pen merger. The vowels in pin and pen are pronounced the same when the precede "nasal" sounds (like 'm' and 'n'). The Southern voices at Rime accurately reproduce this classic phenomenon as can be heard below. In this example the text input was 'ten' and 'tin' and 'pen' and 'pin' respectively:

pen/pen merger

Monophthongization

Another rather noticeable feature of Southern US English is "monophthongization" of the vowel in eye, which essentially means rendering a diphthong (which we discussed in this blog post on Boston English) as a single vowel. In the case of Southern English, the Standard American English "aɪ" diphthong becomes a pure, enlongated "aː". In the Rime samples below, we can hear this in the words ride and aisle (note also the common "g-dropping" at the end of walking).

ride/rahd

aisle/ahl

Even Subtler Features: eɪ versus ɛɪ

At Rime, we're not satisfied with accurately reproducing only the broad and obvious features of a given accent. Rather, we want to capture an accent so accurately that even features that are only subconsciously noticeable to native speakers are correctly produced. Our accents are improving all the time, but we're already able to capture extremely subtle phenomena.

For example there's a phenomenon, related to the "aɪ" --> "aː" monophthongization noted above, where the vowel in words like day is pronounced differently: the "eɪ" vowel in Standard English day becomes "ɛɪ" in Southern English. This difference is non-trivial for non-Southern speakers to hear, but we can measure it they way we normally measure vowels.

The first part of the diphthong "ɛ" has average formant measurements of F1=610, F2=1900 for male speakers, as opposed to F1=390, F2=2300 for "e". And when we measure the first vowel in hey from the above clip, the results are right in line with the standard for the "ɛ" vowel: F1=~650khz, F2=~1850khz (represented as the lowest two dotted lines).

The derived formants for a Rime southern accent *hey*, 1895khz reference for F2

The Bigger Picture

The particulars of Southern US speech are very important for getting conversational AI for this major accent to sound right. But the bigger picture is that text is very different than speech. Text is to language sort of the way clothes are to human anatomy: a helpful tool that corresponds to a logically prior natural phenomenon, but one that also simplifies and standardizes it.

Text conflates and obscures countless features of spoken language and is a blunt but effective tool for silent communications. The data that all standard generative AI is trained on generally does not include these spoken-language patterns. At Rime, we're building TTS that faces these facts head-on and only this way will conversational AI truly deliver on its promise.