Speech-Based Emotion Recognition

Author: ORCID icon orcid.org/0000-0003-3979-8710
Gao, Ye, Computer Science - School of Engineering and Applied Science, University of Virginia
Stankovic, John, EN-Comp Science Dept, University of Virginia

Many algorithms on speech-based emotion detection that utilize machine learning are published. They are often trained and tested on datasets that consist of audio clips in which the speaker emulates emotion such as anger, happiness, neutrality, and sadness. Despite the high accuracy that the algorithms have achieved, they are not suitable for real-life deployment for two reasons. First, the datasets are often times collected in strictly controlled environments where noises are minimum and the microphone is placed very close to the speaker, which is not representative of real-life environments in which background noises are present and people are not expected to be adjacent to the acoustic sensor(s) all the time. Second, each audio clip is usually uttered by an actor, and labeled with the emotion that the actor attempts to simulate. However, research indicates no evidence that the acoustic features of acted emotion are representative of the acoustic features of authentic, spontaneous emotions. As a result, algorithms trained on acted speech may not achieve the same excellent performance when deployed in real-life environments to detect emotions in people’s speech. This thesis explores different approaches to address the problem that high-performing machine learning classifiers on speech-based emotion recognition may not be fit for use in real-life deployment, and proposes an acoustical classifier for emotion detection fit for real-life deployment. The classifier is intended to be part of a smart healthcare system to monitor the users’ emotions.

MS (Master of Science)
Convolutional neural networks, Machine learning, Affective computing, Cyberphysical system
Issued Date: