Human Vocal Event Detection for Realistic Health-Care Applications

Author:
Salekin, Asif, Computer Science - School of Engineering and Applied Science, University of Virginia

Advisor:
Stankovic, John, EN-Comp Science Dept, University of Virginia

Abstract:

Supported by rapid innovations in machine learning, signal processing, and internet of things technologies, the concept of passive sensing is redesigning almost every aspect of our lives. Innovating novel, low cost and noninvasive sensing techniques to model/identify human events (i.e., emotions, mental disorder, etc.) has become one of the core research interests. Advancement in passive sensing has made the development and operation of complex human health monitoring systems technically feasible. Automated and passive human event sensing can improve assessment and treatment of mental disorders, monitoring and care of patients suffering from agitation, dementia or stroke rehabilitation, extensively reduce the work-load of caregivers, and provide more timely and accurate responses to crisis.
Sound is ubiquitous in the expression of human events and its surrounding environment. According to multiple studies, sound as a modality conveys bio markers of our mental and behavioral states or events. The major scopes of research on human audio event detection are: detection of speech emotion, assessment of mental disorders, behavioral and ambient human event detection. Despite the rapid growth of interest in audio sensing for health applications in recent years, yet, accuracy of detection or modeling human verbal events is far from desirable to have any practical implication.
This is due to some open challenges, such as, distortion of acoustic features with variation of speaker to microphone distances, unavailability of strongly labeled audio data, expression of verbal events through consolidation of prosody and context of speech, ambiguity in lexical speech content, limitation of available training data, etc.
In this dissertation, I will present my recent and ongoing research to demonstrate that development and application of novel and adaptive feature engineering approaches, such as, adaptive feature selection, synthetic data generation, and effective feature representation generation, can address the open challenges of human vocal event detection in the scope of health monitoring. With this goal in mind, we have built four automated vocal event detection frameworks that addresses the open challenges in the four major scopes of interest.
Our Distant Emotion Recognition (DER) approach addresses the challenge of acoustic feature distortion due to distance, by a novel distant feature selection approach and a novel, feature modeling/engineering approach, named Emo2vec. A comprehensive evaluation, conducted on two acted datasets (with artificially generated distance effect) as well as on a new emotional dataset of spontaneous family discussions (38 participants) with audio recorded from multiple microphones placed in different distances, showed presented DER approach achieves a 16\% increase on average in accuracy compared to the best baseline method.
This thesis presents a novel weakly supervised learning framework for detecting individuals high in symptoms of mental disorders by addressing the challenge of having weakly (i.e., not well annotated) labeled audio data. Our solution presents a novel feature modeling/engineering technique named NN2Vec to generate low-dimensional, continuous, and meaningful representation of speech from such audio samples, and achieves F-1 scores of 90.1\% and 85.44\% in detecting speakers high in social anxiety and depression symptoms.
Later, we present DAVE, a comprehensive set of verbal behavioral event detection techniques that includes combining acoustic signal processing with three different text mining paradigms to detect verbal events which need both lexical content and acoustic variations to produce accurate results. Additionally, it adapts a novel word sense disambiguation approach to detect verbal context with multiple ambiguous meanings. Following, the thesis presents a novel framework for ambient human event detection (AHED), which generates robust models for audio monitoring applications with limited available data. The solution presents Audio2Vec, a novel computationally effective feature modeling/engineering technique, and a synthetic training data generation approach from limited audio samples.
Finally, I will discuss the limitations of the presented solutions and lay out my future plans for future improvements.

Degree:
PHD (Doctor of Philosophy)

Keywords:
Deep Learning, Ubiquitous Computing, Internet of Things (IoT), Health, Voice, Sound, Health monitoring, Emotion, Mental disorder, Anxiety, Depression, Social anxiety, Health care, Mobile Health, Cyber Physical Systems

Human Vocal Event Detection for Realistic Health-Care Applications

Files