Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Author: ORCID icon
Sinha, Sanchit, Computer Science - School of Engineering and Applied Science, University of Virginia
Qi, Yanjun, EN-Comp Science Dept, University of Virginia

Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy applications of NLP in high-stake areas like medicine or finance. Our work demonstrates model interpretations can be fragile even when the model predictions are robust. By adding simple and minor word perturbations in an input text, Integrated Gradient and LIME generate substantially different explanations, yet the generated input achieves the same prediction label as the seed input. Due to only a few word-level swaps (that follow language constraints), the perturbed input remains semantically and spatially similar to its seed and therefore, interpretations should have been similar too. Empirically, we observe that the average rank-order correlation between a seed input’s interpretation and perturbed inputs’ drops by over 20% when less than 10% of words are perturbed. Further, rank-order correlation keeps decreasing as more words get perturbed. We demonstrate the results on 2 different models - DistilBERT and RoBERTa across 3 datasets namely, SST-2, AG News and IMDB ausing 2 distinct black and white-box interpretation methods - Integrated Gradient and LIME.

MS (Master of Science)
Natural Language Processing, Explainability, Interpretability, Machine Learning
Issued Date: