Large Language Model Natural Language Processing and Synthetic Data Production; A Comparative Study of Social Media Moderation Policies in the U.S. and China

Cassatt, Benjamin

Large Language Model Natural Language Processing and Synthetic Data Production; A Comparative Study of Social Media Moderation Policies in the U.S. and China 114 views

Author

Cassatt, Benjamin, School of Engineering and Applied Science, University of Virginia

Advisors

Vrugtman, Rosanne , EN-Comp Science Dept , University of Virginia
Francisco, Pedro Augusto , EN-Engineering and Society , University of Virginia

Abstract

Software Engineering has been at the forefront of today’s society over the past decade. From powering the social media platforms we scroll through daily, to using artificial intelligence (AI) to approach modern problems, it has reshaped every aspect of our lives. My capstone research and STS research both reflect on and demonstrate this transformation, showcasing how innovative software solutions can address real-world challenges while also prompting critical reflection on the broader societal impacts of these technologies. In my capstone research, I chose to reflect on a project I worked on in my internship over the past summer. This project involved the production of synthetic data, as well as using natural language processing via AI. In my STS research, I explored the contrasting approaches to social media moderation in the United States and China. By examining the cultural, political, and societal factors that shape each country’s policies, I aimed to understand the deeper motivations behind platform regulation and social media content control. While these projects examine different spheres of software engineering concepts, they both exemplify the importance of a deeper understanding of the engineering behind the software we use everyday. Together, they highlight how technical innovation and ethical awareness must go hand in hand to build responsible, impactful technologies.

In my capstone research, I reflected on a project I worked on during my internship this past summer, which aimed to solve the challenge IRS software testers face in quickly generating realistic, large-scale test data for evaluating tax processing systems. The goal of the project was to enable IRS (Internal Revenue Service) software testers to rapidly produce software test data. The project revolved around the use of an existing LLM (Large Language Model) in order to translate natural language into database queries. It also involved attempting to use the LLM to produce synthetic data.

Our final implementation, DataFetch, addressed key challenges in IRS software testing by generating synthetic personas using Python scripts and census data, eliminating the need for sensitive real-world data. This improved production efficiency, generating around 10,000 records every 30 seconds, reducing wait times for testers. Testers could make natural language requests to retrieve tailored data via SQL, but we encountered limitations with the LLM when editing complex forms or creating nuanced test cases involving logic, errors, or fraud.

 In my STS research, my goal was to delve deeper into this contrast between the U.S. and China to answer the following question: How do social media platform and moderation policies in the United States and China differ, and what factors drive these differences? By examining the motivations behind social media moderation in these two countries, we can determine the forces that shape these policies. Understanding this is essential for grasping how moderation standards are determined based on different cultural and political environments. I used two frameworks—SCOT to explore how social values influence tech design, and ANT to map interactions between users, platforms, and governments. I analyzed sources from both countries to stay balanced. Together, these methods helped me understand why moderation looks so different in each country and what that says about their digital societies.

In my STS research I found that while both use AI and human moderation, the goals differ—U.S. platforms prioritize profit and free expression, while Chinese platforms focus on government control and censorship. I used SCOT to show how societal values shape moderation tools and ANT to highlight the different actors involved, from users to governments. My research shows that moderation systems reflect each country’s deeper social and political priorities.

Degree

BS (Bachelor of Science)

Keywords

Natural Language Processing; Content Moderation; Large Language Model; Social Media

Notes

School of Engineering and Applied Science

Bachelor of Science in Computer Science

Technical Advisor: Rosanne Vrugtman

STS Advisor: Pedro Francisco

Language

English

Rights

Attribution 4.0 International (CC BY)

Issued Date

2025-05-08

Persistent Link

https://doi.org/10.18130/avqp-2p02

Suggested Citation

Cassatt, Benjamin. Large Language Model Natural Language Processing and Synthetic Data Production; A Comparative Study of Social Media Moderation Policies in the U.S. and China. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2025-05-08, https://doi.org/10.18130/avqp-2p02.