In pursuit of the perfect AI voice
How developers are humanizing their virtual personal assistants.
The virtual personal assistant is romanticized in utopian portrayals of the future from The Jetsons to Star Trek. It's the cultured, disembodied voice at humanity's beck and call, eager and willing to do any number of menial tasks.
In its early real-world implementations, a virtual receptionist directed customers ('To hear more menu options, press 9'). Voice-typing software transcribed audio recordings. It wasn't until 2011 that Apple released Siri and the public had its first interactions with a commercially viable, dynamic personal assistant. Since Siri's debut with the release of the iPhone 4S, Apple's massive customer base has only gotten larger; the company estimates that more than 700 million iPhones are currently in use worldwide.
Amazon's Alexa and Microsoft's Cortana debuted in 2014; Google Assistant followed in 2016. IT research firm Gartner predicts that many touch-required tasks on mobile apps will become voice activated within the next several years. The voices of Siri, Alexa and other virtual assistants have become globally ubiquitous. Siri can speak 21 different languages and includes male and female settings. Cortana speaks eight languages, Google Assistant speaks four, Alexa speaks two.
But until fairly recently, voice -- and the ability to form words, sentences and complete thoughts -- was a uniquely human attribute. It's a complex mechanical task, and yet nearly every human is an expert at it. Human response to voice is deeply ingrained, beginning when children hear their mother's voice in the womb.
What constitutes a pleasant voice? A trustworthy voice? A helpful voice? How does human culture influence machines' voices, and how will machines, in turn, influence the humans they serve? We are in the infancy stage of developing a seamless facsimile of human interaction. But in creating it, developers will face ethical dilemmas. It's becoming increasingly clear that for a machine to seamlessly stand in for a human being, its users must surrender a part of their autonomy to teach it. And those users should understand what they stand to gain from such a surrender and more importantly, what they stand to lose.
Teri Danz is a vocal coach who was named by entertainment-industry publication Backstage as one of the top eight in the United States. Her clients include singers, news anchors and stand-up comedians wishing to improve their technique, range and nerves. Among her most high-profile clients are comedian Greg Fitzsimmons and actor Taylor Handley.
Danz believes that current VPA voices lack resonance -- the vocal quality most associated with warmth.
When I asked Danz to listen to three Siri voice samples from three different eras -- iOS 9 (2015), iOS 10 (2016) and iOS 11 (2017) -- she connected their differences to Apple's target audience.
"As the versions progress from iOS 9, the actual pitch of the voice becomes much higher and lighter," said Danz. "By raising the pitch, what people hear in iOS 11 is a more energized, optimistic-sounding voice. It is also a younger sound.
"The higher pitch is less about the woman's voice being commanding and more about creating a warmer, friendlier vocal presence that would appeal to many generations, especially millennials," continued Danz. "With advances in technology, it is becoming easier to adapt quickly to a changing marketplace. Even a few years ago, things we now take for granted in vocal production may not have been developed, used or adopted."
There is research to support Danz's conclusions: The book Wired for Speech: How Voice Activates and Advances the Human–Computer Relationship by Clifford Nass and Scott Brave explores the relationships among technology, gender and authority. When it was published in 2005, Nass was a professor of communications at Stanford University and Brave was a postdoctoral scholar at Stanford. Wired for Speech documents 10 years' worth of research into the psychological and design elements of voice interfaces and the preferences of users who interact with them.
According to their research, men like a male computer voice more than a female computer voice. Women, correspondingly, like a female voice more than a male one.
But regardless of this social identification, Nass and Brave found that both men and women are more likely to follow instructions from a male computer voice, even if a female computer voice relays the same information. This, the authors theorize, is due to learned social behaviors and assumptions.
Elsewhere, the book reports another, similar finding: A "female-voiced computer [is] seen as a better teacher of love and relationships and a worse teacher of technical subjects than a male-voiced computer." Although computers do not have genders, the mere representation of gender is enough to trigger stereotyped assumptions. According to Wired for Speech, a sales company might implement a male or female voice depending on the task.
"While a male voice would be a logical choice for [an] initial sales function, the complaint line might be 'staffed' by a female voice, because women are perceived as more emotionally responsive, people-oriented, understanding, cooperative and kind.
However, if the call center has a rigid policy of 'no refunds, no returns,' the interface would benefit from a male voice as females are harshly evaluated when they adopt a position of dominance."
Rebecca Kleinberger, a research assistant and PhD candidate at the MIT Media Lab, added some scientific context to Nass and Brave's findings. Her primary academic interest is voice and what people can learn about themselves by listening to their voice.
"Unlike a piano note, which, when looking at a spectrogram, will be centered around a single main frequency peak, a human voice has a more complex spectrum," Kleinberger said. "Vocal sounds contain several peaks that are called formants, and the position of those formants roughly corresponds to the vowel pronounced. So the human voice might be seen more as playing a chord on the piano rather than a single note. Sometimes, these formants are going to have a musically harmonious relationship between themselves, like a musical chord, and sometimes, they have an inharmonious relationship and the chord sounds 'off' according to the rules of western harmony."
"Interestingly, in the lower frequencies, those formants have a more harmonious relationship than in the higher," Kleinberger continued. "Because of bone conduction, we each individually hear the lower part of our own voice better or louder than the higher parts. This seems to play a role in the fact that most of us dislike hearing our own voice recorded and also why generally we might prefer lower voices to higher voices."
It might also be why Siri's 2013 voice, according to communications-analytics company Quantified Communications, had a pitch that was 21 percent lower than the average woman's -- not only to reflect "masculine" qualities but also to sound acoustically pleasing.
What might we learn from all of this? Users want technology to assist them, not tell them what to do. And a fledgling technology company, eager to gain a foothold in a competitive marketplace, might rather play into cultural assumptions -- create a feminine voice that is, ironically, low in pitch -- instead of challenging deeply ingrained biases. It's more expedient to uphold the status quo instead of attempting to change it.
Engadget reached out to several technology companies and asked how they determined the voice they use. Amazon was the only company that responded and stated in an email in Engadget, "To select Alexa's voice, we tested several voices and found that this voice was preferred by customers." In an article in Wired, writer David Pierce interviewed Apple executive Alex Acero, who is in charge of Siri's technology. The company's designers and user-interface team sifted through hundreds of voices to find the right ones for Siri.
"This part skews more art than science," Pierce writes. "They're listening for some ineffable sense of helpfulness and camaraderie, spunky without being sharp, happy without being cartoonish."
The common retort to concerns about bias and subjectivity is that technology does not determine culture but is merely a reflection of the culture. But in an interview with the Australian Broadcasting Corporation, Miriam Sweeney, a feminist researcher and digital media scholar at the University of Alabama, discusses how digital assistants are often subject to verbal abuse and sexual solicitation. The VPA will respond to this abuse with a moderate, even apologetic tone, regardless of the user's treatment. And when VPAs have feminine voices, which are often programmed to flirt back or respond with sassy repartee, it renders that bad behavior acceptable.
No real human should be subject to this sort of treatment. If developers' quest is to create a relatable, digital stand-in, they may have to imbue their creations with a basic sense of dignity and respect.
Anyone who has given a public speech knows that voices change depending on the environment you're in.
In an auditorium, for instance, sweat collects. Muscles in the shoulders, neck and throat tighten. And much of the resulting physical pressure goes to the throat's vocal folds, which bear increased tension and vibrate at a faster rate. That's why so many people sound strained and high-pitched when speaking to a crowd. Combine this with irregular, quickened breathing that can cause a voice to shake or crack and even the most practiced orator can fall victim to nerves.
In the course of her research, Kleinberger has also observed that your voice -- the musicality, the tempo, the accent and especially the pitch -- changes depending on the person you're talking to. Kleinberger notes that when women are in a professional setting, they typically use a lower voice than when they speak to their friends.
These variables are ingrained in the human experience, because reproducing sound is an inverse process that begins with mimicry. There are many ways, for example, to shape one's mouth to make the "ma ma" sound, and positive reinforcement from parents and peers will shape people's vocal techniques from a young age.
A major part of VPAs' appeal is that they replicate human interaction -- they respond to their users with predetermined jokes, apparently offhand remarks and short, verbal affirmations. Yet unlike humans, they are unerringly consistent in the way they sound.
Wired for Speech co-author Scott Brave, who is now CTO of FullContact (which helps businesses analyze their customers for marketing and data purposes), expressed enthusiasm for a more "neurological" layer of insight -- insight he did not have when he and Nass were conducting their experiments. He also discussed what surprised him most while writing Wired for Speech.
"One of the studies I was involved in [years ago] was related to emotions in cars, and what was the 'right' emotion for a car to represent as a co-pilot," said Brave, who earned a PhD in human-computer interaction at Stanford. "It turns out that matching the user's emotions is more important that what the emotion is. It makes the user think, 'Hey, this entity is responding to me.'"
"If a person isn't feeling calm, is it always going to be the case that a calming [computer] voice will be the most effective?" asked Brave. "The best way to get someone to change his or her state is to first match that state emotionally and then bring that person to a place that's soothing."
Perhaps trying to pinpoint a perfect voice was the wrong question all along. The future is no longer about developing a single ideal voice that appeals to the widest audience possible. That's just a stopgap measure on the path toward the real goal: to create a voice that, like ours, changes in response to the human beings around it.
"Technologies have individuality of voice, but they lack prose of voice."
"Ideally, the machine acknowledges context to what is being said," said Brave. "Because the needs of a user get expressed over the course of a conversation. A user cannot always express what he wants in a few words.
"Some of that context is linguistic: What does a person mean when he says a particular word? Some of that is emotional, and some of that is historical," said Brave. "There are many types of context. And our current systems are aware of very few."
Kleinberger agreed with this sentiment.
"[When technologies speak to us currently], they're doing so in voices that are uncanny, still slightly robotic and non-contextual," said Kleinberger. "Technologies have individuality of voice, but they lack diversity and responsivity of the prosody, vocal posture and authenticity. An individual's prosody changes all the time and is very responsive to the context."
Today, technology can pick out a voice's subtleties to a specific, perhaps discomforting degree. Hormone levels, for example, can affect the texture of a person's voice.
"Our voice reveals a lot about our physical health and mental state," Kleinberger said. "Changes in tempo in sentences can be used as a marker of depression, breathiness in the voice can be an indicator of heart or lung disease, and acoustic information about the nonlinearity of air turbulences could even predict early stages of Parkinson's disease.
"Smart home devices are listening to us all the time, and soon, they might be able to detect those physical and mental conditions, and as the voice is also very dependent on our hormone levels, even one day detect if someone is pregnant before the mother knows it," Kleinberger continued.
An always-on AI can act as a fly on the wall. It can extract metadata as it listens to partners and family members talk to one another. It can detect the social dynamics among people solely from the acoustic information it gathers. What it cannot do, as of now, is explicitly act upon this information. It will not change its phrasing to match a person's unstated preference; it will not raise or lower its pitch depending upon who is requesting its help -- yet. Kleinberger believes we may be as few as five years from this.
Could a personal assistant someday listen to its users, detect stress, suss out power imbalances in relationships and match its voice, phrasing and tempo accordingly? If so, the "ideal" voice is specific to each person, and like a human's voice, it should adjust itself in real time throughout the day.
If this is successfully implemented, it has enormous societal potential. Imagine an AI that corresponds to its user's manner of speaking -- that raises its voice or reacts sharply in response to its user's tone, not just the content of her speech.
"Could Siri mimic the voice of the user to be more likable? Absolutely. We humans do that all the time unconsciously, adapting our vocal timbre to the people we talk to."
There is an uncanny, morally gray area that comes with this territory. The goal of many developers is to create a seamless illusion of sentience, yet if users are being monitored and judged beyond their control or consent, the technology can easily be read as insidious or manipulative. Kleinberger mentioned Microsoft's infamous Clippy as a cautionary example of how users want to be catered to but not intruded upon unsolicitedly.
"There are many tangible benefits from collecting data from the voice, but I believe that creating a 'truly caring dialogue' between Siri and a user is not one of them," said Kleinberger. "But could Siri mimic the voice of the user to be more likable? Absolutely. We humans do that all the time unconsciously, adapting our vocal timbre to the people we talk to.
"It would be great," Kleinberger concluded, "as long as the whole process and the data are transparent for and controlled by the user."
On the issue of privacy, Apple is more discreet than competitors like Google or Facebook. Rather than pulling data off a server to customize its assistance, Apple emphasizes the less intrusive power of its machine learning and AI.
But Apple's competitors already have that, and recently, Siri has fallen behind other VPAs with regards to its overall ability and diversity of features. It's become apparent that the more personal information a user surrenders, the more the VPA can learn from it, and the better the VPA can serve the user.
Human-to-human relationships, after all, require openness and transparency. Perhaps for humans to create that sort of dialogue with technology -- whether through voice or another avenue -- they must be similarly open and transparent to the technology. And a trusting relationship cuts both ways: Companies need to be more explicit and aboveboard about the type of data they collect from consumers and how that data is used.
How much of yourself are you willing to sacrifice in search of symbiotic perfection?