EY Internship: Using Natural Language Processing to Vet Clients; Hidden Bias in Natural Language Processing

Zimmerman, Evan, School of Engineering and Applied Science, University of Virginia
Vrugtman, Rosanne, EN-Comp Science Dept, University of Virginia
Francisco, Pedro Augusto, EN-Engineering and Society, University of Virginia

In our era of technological progress, great advances in Natural Language Processing (NLP)-an interdisciplinary field combining linguistics and computer science-have enabled computers to understand, interpret, and generate human language, text, or speech at a seemingly human level. The fundamental building block of NLP systems is word embeddings which are numerical representations of words that capture their semantic meanings, relationships, and context within a multi-dimensional vector space. My technical research report details a project from my summer internship at EY where I developed a non-traditional, computationally efficient method for generating word embeddings as part of a larger project to build an NLP system to detect evidence of financial crime. Powerful NLP applications bring immense promise but suffer from significant challenges with bias. My sociotechnical research explored bias in NLP through the framework of actor-network theory to uncover the gaps in the current network of NLP systems that could be filled to address bias. The combination of my technical and sociotechnical research provides a holistic approach to improving NLP technologies by offering solutions to enhance their efficiency as well as their fairness and inclusivity.

The technical report outlines the development and implementation of a novel, computationally efficient method for generating word embeddings as part of a broader project to create an NLP tool to streamline the detection of financial crime indicators in textual data. The tool’s approach to generating word embeddings diverged from traditional neural network-based methods by employing a probability-based technique that integrated dimensionality reduction. The methodology leveraged a combination of pointwise mutual information and Singular Value Decomposition (SVD) to produce meaningful word embeddings that effectively capture the association between terms and the context in which they appear. By reducing computational demands without significantly sacrificing accuracy, this method enhanced the practicality of deploying NLP systems in business environments where speed and resource efficiency are crucial. 

The NLP model was successful in connecting key terms associated with financial crimes to relevant companies, within a corpus of business articles. Despite its achievements, the model recognized some false associations, suggesting a need for further refinement in data preprocessing to filter out irrelevant content. The success of the internship project highlights the potential of this streamlined NLP system to enhance the efficiency of financial crime detection processes at EY. The NLP model provided a valuable first-level filter that could significantly reduce the manual effort required in initial screening processes, thereby allowing for a more focused application of human analytical skills where they are most needed. This project not only contributed a functional tool to EY’s arsenal for risk management but also showcased the innovative potential of applying simplified machine learning techniques in the professional service sector, where computational resources are sometimes thin.

The sociotechnical research paper delves into the pervasive issue of hidden bias within NLP technologies, exploring both its societal impacts and the technological underpinnings that contribute to its propagation. Driving the research were questions probing how biases embedded in NLP systems affect society and individuals, and what measures could potentially mitigate these biases to foster more ethical NLP applications. A comprehensive literature review and case studies of various instances of bias in NLP applications were the primary research methodologies employed to answer these central questions. The actor-network theory framework was applied during research to examine the roles of the various stakeholders in the NLP ecosystem, aiming to uncover how biases manifest and how different stakeholders in the network are impacted.

The evidence gathered in the sociotechnical research highlights significant instances where NLP biases have negatively influenced social dynamics, such as in employment through biased resume screening tools. The findings emphasize that bias in NLP not only mirrors societal prejudices but potentially amplifies them. The paper concludes with a critical discussion of the necessity for a multifaceted mitigation strategy that combines technical solutions with broader sociotechnical interventions. It emphasizes the need for an ongoing collaborative effort among all stakeholders involved in the development, use, and regulation of NLP technologies. The research also proposes cultivating a new type of stakeholder that could implement effective solutions that address both the technical and ethical dimensions of bias in a manner that benefits all stakeholders in the network of NLP systems.

BS (Bachelor of Science)
NLP, Bias, Ethics, AI

School of Engineering and Applied Science

Bachelor of Science in Computer Science

Technical Advisor: Rosanne Vrugtman

STS Advisor: Pedro Augusto Francisco

All rights reserved (no additional license for public reuse)
Issued Date: