Large Language Models and Web Scraping: A Flawed Commensalism; Data Ownership: is the Internet Free?
Gajjala, Hari, School of Engineering and Applied Science, University of Virginia
Vrugtman, Rosanne
Francisco, Pedro
Novel AI tools are built on vast amounts of data, and how we collect that data is as important as how we use it. In this summary I will describe both my Capstone and STS research projects, which examine different aspects of the process of data collection from the Internet for the usage of Large Language Models (LLMs). In my Capstone project, I described and developed a technical methodology for improving data extraction from websites using a fine-tuned LLM and human supervision. And in my STS paper I analyzed why technology companies have historically disregarded legal frameworks when scraping the Internet. These papers are connected as they focus on Internet data as a technical or ethical challenge.
My Capstone project aims to fix a core technical challenge of web scraping, which is that it is difficult to cleanly extract the textual data from the code payload of a web page; inconsistent and dynamic content makes it unreliable and hard to automate. In my paper I proposed a training protocol where an LLM fine-tunes on correctly parsed HTML web pages and a human editor known as the Refiner uses a interface known as PARSEDIT to correct the fine-tune model’s outputs, feeding back into the model to be fine-tuned. This aims to produce an LLM-based web scraper that is not only more effective than current tools but also more flexible.
After approximately 25 training epochs the training protocol is expected to closely match the human-corrected text extractions that the Refiner procures at the end of each training cycle. The fine tuned LLM would then be used for the automated scraping of allowable websites on the Internet, and the labeled data produced by the Refiner can be used for other machine learning applications.
In my STS paper I researched why technology companies are collecting data from sources on the Internet that conflict with legal guidelines. My research question was: Why do technology companies overlook legal structures when scraping data from the internet? I mapped out the relationships between actors like AI developers, publishers, governments, and the Internet using Actor Network Theory (ANT), and I also utilized technological momentum to track the pace of AI development and explore my research question through this different lens.
I concluded in my STS paper that the lack of enforcement, weak legal precedent, and political pressure are the main reasons for the behavior of these technology companies. I also looked into frameworks like data stewardship as a potential solution and discovered that it is a promising solution that works under an environment of strict enforcement.
BS (Bachelor of Science)
Data mining, Internet, Artificial intelligence
School of Engineering and Applied Science
Bachelor of Science in Computer Science
Technical Advisor: Rosanne Vrugtman
STS Advisor: Pedro Francisco
English
2025/04/29