Identifying the Source of Vulnerability in Fragile Interpretations: A Case Study in Neural Text Classification

Tang, Ruixuan, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Ji, Yangfeng, EN-Comp Science Dept, University of Virginia

Along with the development of machine learning models, although the prediction performance has been improved, it becomes more difficult to understand why the model makes such predictions due to the complexity of neural networks. It causes concern and interest in the trustworthiness and reliability of models. One concrete way to improve the trustworthiness and reliability of the neural model is to generate explanations that can demonstrate the relationship between the input and output prediction and help people understand the model. However, some recent observations reveal the interest and concern in the stability of the explanation methods.

Previous works (Ghorbani, Abid, and Zou 2019; Subramanya, Pillai, and Pirsiavash 2019; Xinyang Zhang et al. 2020; Lakkaraju, Arsov, and Bastani 2020) had observations on the explanation vulnerability caused by the perturbation at model input. Some of them have conflicting opinions on what source caused the explanation to be fragile and whether explanation methods are stable. This thesis explores the potential source that caused the fragile explanation. Is it caused by the neural model itself or the explanation method? I focus on a model agnostic explanation method that generates explanations from the post-hoc manner (e.g., model output probability), which is called the post-hoc explanation method.To investigate the cause of the fragile explanation, I propose a simple output probability perturbation method, which adds a slight noise to the output probability of the neural model. Compared to prior perturbation methods applied at the model's inputs, the output probability perturbation method can circumvent the neural model's potential influence on the explanation. The proposed method is evaluated by three post-hoc explanation methods (LIME, Kernel Shapley, and Sample Shapley) and three neural network classifiers (CNN, LSTM, and BERT). The result demonstrates that the neural model is the potential primary source causing fragile explanations. Furthermore, I analyze the kernel calculation inside a post-hoc explanation method (LIME). The result demonstrates the stability of this post-hoc explanation method.

MS (Master of Science)
Stability, Post-hoc Explanation Methods, Machine Learning, Text Classification
Issued Date: