Towards Improving Adversarial Robustness of NLP Models

Yoo, Jin Yong, Computer Science - School of Engineering and Applied Science, University of Virginia
Qi, Yanjun, EN-Comp Science Dept, University of Virginia

Adversarial training has been extensively studied as a way to improve model's adversarial robustness in computer vision. On the other hand, little attention has been paid in NLP as to how adversarial training affects model's robustness. Within NLP, there exists a significant disconnect between recent works on adversarial training and recent works on adversarial attacks as most recent works on adversarial training have studied it as a means of improving the model's generalization capability instead of as a defense against adversarial attacks.

In this thesis, we investigate how adversarial training can be used to improve the model's adversarial robustness as well as its standard accuracy, cross-domain generalization, and interpretability. In the first half of this thesis, we perform a comprehensive benchmarking of different search algorithms used in NLP adversarial attacks and propose a new simple and efficient search algorithm that can speed up adversarial attacks for adversarial training. Then, using the findings from the benchmark experiments, we create two new adversarial attacks optimized for adversarial training and use them to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, and Yelp datasets. We demonstrate that adversarial training can not only improve the model's robustness to the adversarial attack it was originally trained with, but also defend the model against other types of attacks. Also, we show that adversarial training can improve model's standard accuracy, cross-domain generalization, and interpretability.

MS (Master of Science)
Machine Learning, Natural Language Processing
Issued Date: