Detoxification of Pretrained Language Models through Subspace Transformation

Author: ORCID icon
Sudhakar, Mohit, Computer Science - School of Engineering and Applied Science, University of Virginia
Ji, Yangfeng, EN-Comp Science Dept, University of Virginia

Large pre-trained language models have achieved significant performance on a multitude of language understanding and inference tasks. However, because of training on large-scale unfiltered texts on the internet, they often encode some unexpected information, such as toxic or abusive language, which makes the real-world usage of these language models limited. Our work investigates the existence of a low-dimensional bias subspace in the latent space of pre-trained language models. Empirically, we show that there exists a common bias subspace across four popular models, BERT, GPT-2, RoBERTa, and XLNet, and conduct layer-wise subspace analyses of these models. We provide a method to construct a bias subspace by manipulating principal components with linear transformations. Our debiasing method does not require fine-tuning and can be used on existing pre-trained language models at inference. We also provide a methodology for constructing parallelized datasets using word-level bias scores. This method can be used for constructing parallel datasets for any kind of bias. We show that when the bias subspace is removed, the toxic sentence classification accuracy drops, and this result is consistent across pre-trained models like BERT and other variants.

MS (Master of Science)
NLP, Subspace exploration, Pre-trained Language Models, PCA
Issued Date: