A more human way of communicating with machines

10

October

2017

5/5 (2)

When we – as human-beings – communicate with each other, we use our voice to do so. By the age of six months most of us understand the sounds of our native language. By the age of two know a word for almost everything and at the end of five we are able to perfectly use adult grammar. This is impressively fast for acquiring such a complex and difficult skill like speaking a language.
Speech allows us to communicate efficiently and more or less fast which each other – way faster than when using text as an input medium. However, in these days, in which we communicate via and especially with machines as much as never before in history, we mainly use exactly text to do so although the input rate of text is slow – a lot slower than direct speech. This is why we need a more human way of communication when interacting with machines. And for that we need really good speech recognition.

The History of Speech Recognition
In 1952 Bell Laboratories invented“Audrey” which only was able to comprehend spoken numbers. Later in the 1970s Carnegie Mellon came up with “Harpy”. Harpy was able to understand 1011 words by breaking them up into bits and focusing on their features such as vowels they would contain. However, as people pronounce words differently, Harpy had its troubles and was not really accurate. To tackle this issue speech recognition systems started to use statistical models during the 1980s where the Hidden Markov Model laid the foundation for today’s speech recognition.
In the year of 2001, speech recognition reached an accuracy of 80%and not much progress was made since then. But the development of Deep Neural Networks and specifically of Recurrent Neural Nets (used when your data changes over time) in the past years led to incredible performance improvements. In June 2017 Google claimed that it has reached over 95% accuracy (in steril non-noisy environments) making its speech recognition better than your natural one.

Garbage in – garbage out
The speech recognition models highly rely on training data. This labelled data needs to be of high quality to really train and improve the underlying algorithms. Google, Apple, Microsoft and Amazon are gathering vast amounts of this data from real-life examples through their hardware devices and feed them to their neural nets.
Nonetheless, the question remains how to gather high quality data that will help the neural nets to properly detect feelings and emotions in human speech. This is what really will allow us to authentically communicate with machines and to create services that will be adaptive to the emotions detected in human speech.

References
Google. (2014, October 14). Behind the Mic: The Science of Talking with Computers. Retrieved from https://www.youtube.com/watch?v=yxxRAHVtafI

National Institute of Health. (n.d.). Speech and Language Developmental Milestones | NIDCD. Retrieved from https://www.nidcd.nih.gov/health/speech-and-language

Protalinski, E. (2017, May 17). Google’s speech recognition technology now has a 4.9% word error rate | VentureBeat. Retrieved from https://venturebeat.com/2017/05/17/googles-speech-recognition-technology-now-has-a-4-9-word-error-rate/

Stanford. (n.d.). Emotion Detection from Speech. Retrieved from http://cs229.stanford.edu/proj2007/ShahHewlett%20-%20Emotion%20Detection%20from%20Speech.pdf

Van der Straten, S. (2017, October 4). Voice is the next big thing – Point Nine Land – Medium. Retrieved from https://medium.com/point-nine-news/voice-is-the-next-big-thing-913b9bbf9016

Please rate this

Leave a Reply

Your email address will not be published. Required fields are marked *