Advertisement

Baidu’s text-to-speech system mimics a variety of accents ‘perfectly'

What used to take hours of neural net training now takes under 30 minutes.

Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. The latest news about the tech are audio samples showcasing its ability to accurately portray differences in regional accents. The company says that the new version, aptly named Deep Voice 2, has been able to "learn from hundreds of unique voices from less than a half an hour of data per speaker, while achieving high audio quality." That's compared to the 20 hours hours of training it took to get similar results from the previous iteration, for a single voice, further pushing its efficiency past Google's WaveNet in a few months time.

Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance. "Deep voice 2 can learn from hundreds of voices and imitate them perfectly," a blog post says.

In a research paper (PDF), Baidu concludes that its neural network can create voice pretty effectively even from small voice samples from hundreds of different speakers. All of which to say, it might not be long before we start hearing digital assistants that are more representative of the voices users encounter in their day-to-day lives.

To hear how far the tech has come and for more information of how the team got to this point, hit the source links below.