Data-Diverse Redundant Processing for Noise-Robust Automatic Speech Recognition

Hotaki, Mustafa, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Williams, Ronald, EN-Elec/Computer Engr Dept, University of Virginia
Alemzadeh, Homa, EN-Elec/Computer Engr Dept, University of Virginia

Robustness to acoustic noise remains a challenge in automatic speech recognition (ASR). In this work, we take a fault-tolerance approach to noise-robustness by applying data diversity to the input speech signal of an ASR system. Motivated by the observation that ASR systems are sensitive to perturbations in the input under noisy conditions, our proposed framework, termed data-diverse redundant processing, creates a diverse set of variants of the input speech signal by applying label-preserving transformations such as time warping and speed modulation. Treating a given ASR system as a black box, we process the variants to generate a list of transcripts, termed hypotheses. Our experiments show that we are able to generate diverse hypotheses in noisy environments quantified by the average of pair-wise word error rate (WER) values, known as the Cross-WER. We show error-correcting potential or complementarity of these hypotheses using the notion of an oracle combination or a "best possible combination" of them guided by the ground truth or reference transcripts, for which we provide an algorithm. Our results show potential for consistent reductions (in an ideal sense) in WER for noisy speech. We implement a modified version of the ROVER algorithm for combining multiple hypotheses into a single hypothesis. We evaluate our framework on clean and realistic noisy speech from the CHiME3 dataset for the Google Cloud Speech to Text, IBM Watson Speech to Text, and Microsoft Azure Speech to Text systems. Our results maintain the original performances on clean data but achieve reductions of 2.31%, 3.88%, and 3.5% over baseline WERs by a simple majority voting mechanism using as few as five transformations. We further show empirical lower bounds on WER on generated confusion networks (CNs) promising even greater reductions in WER of 8.6%, 8.36%, and 6.51% for the Google, IBM, and Microsoft systems respectively. We point to existing work on more sophisticated mechanisms such as confusion network re-scoring using language understanding models to get WER values that more closely resemble these lower bounds. We conclude that data diversity is a viable orthogonal method for noise-robustness, but its efficacy is limited by the underlying ASR and its use is only encouraged when computational overhead of redundant processing is not a concern.

MS (Master of Science)
Automatic speech recognition, noise-robustness, data diversity, hypothesis combination, ROVER
Sponsoring Agency:
U.S. Department of Commerce, National Institute of Standards and Technology (NIST)
All rights reserved (no additional license for public reuse)
Issued Date: