Sentence Level Embedding Detoxification via Toxic Component Removal

Wang, Andrew, Computer Science - School of Engineering and Applied Science, University of Virginia
Ji, Yangfeng, EN-Comp Science Dept, University of Virginia

When state of the art pre-trained language models are trained on internet discourse, they may unintentionally encode and perpetuate some of the hateful and toxic behavior known to exist within the training data. We hypothesize that the encoding of toxicity can be explained by a "toxic subspace" within sentence-level language model embeddings, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. To identify this toxic subspace, we propose a method to find patterns in toxic lexical markers. Through our experiments, we demonstrate that the subspace we find can separate toxic and non-toxic examples in the latent space. We use this subspace to extract a toxic component within sentence embeddings, and demonstrate that the removal of this component causes the number of toxic representations to decrease. These results are promising and suggest that encoders can be reasonably detoxified.

MS (Master of Science)
Representation Learning, Language Models, NLP for Social Good
All rights reserved (no additional license for public reuse)
Issued Date: