Automating the Clustering and Classification Process for Text-Based Documents; An Investigation Into Removing Inherent Biases from Machine Learning Algorithms

Kim, Alexander, School of Engineering and Applied Science, University of Virginia
JACQUES, RICHARD, EN-Engineering and Society, University of Virginia

Machine learning techniques such as natural language processing (NLP) are becoming essential to manipulating the large-scale datasets that we are reliant on. One of the most promising uses for NLP algorithms is identifying potentially hostile or dangerous online users. Detecting these individuals based on their message history may help prevent future crimes from occurring if identification occurs early enough. My technical project focused on designing a system to detect malicious messages using natural language processing and community detection algorithms. My STS research synthesized with this project by analyzing how algorithmic bias might affect the proposed system, as well as outlining potential mitigation efforts.

The technical portion of my project developed a proof-of-concept for the aforementioned message detection system. The first segment of the project utilized the Doc2Vec natural language processing algorithm in conjunction with the cosine similarity metric to develop a network graph visualization of the similar texts. The second segment analyzed the produced network graphs using community detection algorithms such as Louvain in order to reorganize text nodes into previously unseen clusters. Using this visualization, users could customize the model parameters and discover the correlation strength between documents. Analysts may use this system as a filtration mechanism in order to focus their limited time and energy on the most relevant sources of information.

In my STS research, I first defined algorithmic bias and analyzed several known historical cases. I then provided a hypothetical scenario in which my technical project could be viewed as having algorithmic bias. In this scenario, the system attempted to cluster user messages that contained multiple different languages. In datasets with large volumes of English text, the unmodified Doc2Vec algorithm unintentionally clustered together all of the non-English text, regardless of whether or not the topics were similar. Consequently, if there were a hypothetical malicious individual is being investigated, and that individual spoke a language other than English, then the system would be more likely to erroneously identify other individuals that spoke that language as similar people of interest. Given how blatantly this scenario unfairly targets ethnic minorities, I then provided several methods for which algorithmic bias could be mitigated for this system and in general, such as compiling more robust and diverse datasets, utilizing self-regulatory practices and consumer-focused approaches, and updating legal codes to include algorithmic bias regulatory measures.

The technical project was developed as part of my summer internship with the Defense Intelligence Agency. However, it was only after undertaking the STS research that I began to reassess my work and how saw how ethically ignorant I was being when I developed the system. I had been aware of how serious an issue algorithmic bias was, but observing and analyzing it within the framework of my own code showed me how easy it is to unintentionally discriminate against certain groups. I hope that both the system developed and the retrospective analysis done on it can provide enlightenment how powerful machine learning is, and how to avoid the ethical pitfalls that come along with it.

BS (Bachelor of Science)
Machine Learning, NLP, Algorithms, Algorithmic Bias

School of Engineering and Applied Science
Bachelor of Science in Computer Science
Technical Advisor: Daniel Graham
STS Advisor: Richard Jacques

All rights reserved (no additional license for public reuse)
Issued Date: