The Inclusion of Low-Resource Language Communities in the Development of Natural Language Processing Technology; How many words does it take to understand a low-resource language?
Chang, Emily, School of Engineering and Applied Science, University of Virginia
Basit, Nada, University of Virginia
Francisco, Pedro Augusto, EN-Engineering and Society, University of Virginia
To stave off language extinction, the United Nations has declared 2022-2032 the International Decade of Indigenous Languages. Such languages are called low-resource or endangered. Linguists and language technology researchers have raced to preserve these languages in models, hoping that this technology can support language education and revitalization. Not much is known about the theory behind this process. I designed an empirical study to explore how low-resource languages can be represented in embedding spaces, specifically focusing on the minimal data required to generate effective embeddings. I acknowledge that such a theoretical question follows a common trend in language research where the communities we seek to serve are often excluded in research discourse. I explore how to bridge this gap through the lens of community-based research. Should this decade truly preserve languages, researchers, linguists, and community members must participate in language model development.
Because low-resource languages lack the data necessary to train a model, language models leverage cross-lingual transfer learning to learn information on a “source” language and apply it to a low-resource “target” language. Even under this methodology, data scarcity remains an issue. I seek to identify a lower limit to the amount required to perform cross-lingual transfer learning, namely, the smallest vocabulary size needed to create a sentence embedding space. I used varying amounts of language documentation to generate embedding spaces and test these embeddings on the Semantic Textual Similarity Task. Experiments show that the relationship between a sentence embedding's vocabulary size and performance is logarithmic, with performance leveling at a vocabulary size of 25,000. It should be noted that this relationship cannot be replicated across all languages, and this level of documentation does not exist for many low-resource languages. We do observe, however, that performance accelerates at a vocabulary size of ≤ 1000, a quantity that is present in most low-resource language documentation. In establishing this lower limit, researchers can better assess whether a low-resource language has enough documentation to support the creation of a sentence embedding space and language model.
Low-resource language communities rarely share in the profits of the language preservation efforts. The nonprofit organization Lakota Language Consortium promised to preserve the Lakota language of Standing Rock Indian Reservation and spent years gathering recordings from tribal elders only to sell the tribe’s intellectual property back in the form of textbooks. As a field centered around language preservation, low-resource language technologies run the same risk of alienating the communities they are meant to serve. This raises the question of “How are individuals who are part of low-resource language communities contributing to the technologies that are meant to serve their communities?” I conducted a literature review on community-based participatory research (CBPR) in LRL revitalization through the lens of Actor-Network Theory.
The language community participates in the inception of language technology, contributing invaluable information to data curation and model evaluation. Despite this, researchers are the primary actors in the development of language technology, often overseeing the training and finetuning of such efforts. A false binary has emerged in developing language technology, dividing the researcher from the language community. Most case studies fail to achieve all four stages of translation. This false binary can be dismantled through the formation of mutual mentorships where researchers and language communities learn from each other.
BS (Bachelor of Science)
low-resource languages, community-based participatory research, sentence embedding, cross-lingual transfer learning, Hughes Award 2025 Finalist
School of Engineering and Applied Science
Bachelor of Science in Computer Science
Technical Advisor: Professor Nada Basit
STS Advisor: Professor Pedro Francisco
English
2025/05/01