Applying Data Science Techniques to Promote Equity and Mobility in Education and Public Policy

Author: ORCID icon
Kim, Brian, Education - School of Education and Human Development, University of Virginia
Castleman, Benjamin, CU-Leadshp Fndns & Pol Studies, University of Virginia

Data science continues to percolate in high-profile ways across education policy research given: (1) the wide variety of promising methodological tools data science has presented as of late; (2) the rapidly accelerating rise of "big data" across many research domains in education policy where data science methodologies can be applied fruitfully; and (3) how quickly the barriers to entry are diminishing with open-source libraries, educational materials, and low-cost cloud computing. But while the potency and potential value of newer data science methods can be alluring, these data science approaches nonetheless carry with them important assumptions and limitations that must be carefully considered and evaluated in the context of each individual application. Education researchers and policy analysts thus have a growing opportunity to enhance the measurement, analysis, and advancement of equity and mobility in our society with these methods, but also a heavy responsibility to do so carefully, thoughtfully, and with a keen awareness for when the benefits of these methods may ultimately be outweighed by their unintended consequences.

In this dissertation, I hope to offer three concrete examples for how education and policy researchers may navigate this precise tension: exemplifying how these new tools from data science can ultimately be harnessed for greater social good, while also demonstrating the requisite diligence, methodological rigor, and critical awareness necessary for that to happen. Importantly, I frame my contribution through these papers in four parts. First, while I do not seek to advance the frontiers of data science by developing new methodologies (in my reckoning, a realm likely better left to statisticians and computer scientists), I refine the procedures and checks necessary to usefully apply existing data science approaches for the policy analysis context. Second, I strive to make publicly available and open-source my methodologies, code, and documentation, such that this dissertation may serve as a resource for future analysts similarly concerned with the precarious tensions we navigate in these research endeavors. Third, I use the narrative of each paper to make more accessible these methodologies through translational and intuitive explanations; when paired with technical appendices, footnotes, and my open-source codebases, I believe I am able to accomplish this pedagogical endeavor without sacrificing technical precision. Lastly, I moreover argue that each of the proposed papers also makes a substantive contribution to their respective literatures, offering timely new insights into dynamics previously difficult or impossible to ascertain.

In the first chapter, I present a peer-reviewed paper published at Development Engineering and co-authored with fellow graduate student, Daniel Rodriguez-Segura, where we showcase how researchers and policymakers in developing countries can leverage novel geospatial data to precisely identify “education deserts” – localized areas where families lack physical access to education – at unprecedented scale, detail, and cost-effectiveness. In the second chapter, I present a paper with Katharine Meyer and Alice Choe, where we use data from a large-scale, two-way text advising experiment focused on improving college completion to explore variation in student engagement using nuanced interaction metrics and natural language processing. In the third chapter, I present a sole-authored paper where I conduct the first system-wide text analysis of teacher recommendation letters in U.S. postsecondary applications using data from 1.6 million students, 540,000 teachers, and 800 postsecondary institutions to examine whether students are described by teachers in systematically different ways across race and gender groups, even after accounting for salient confounding factors like student academic and extracurricular qualifications, teacher fixed effects, and institution fixed effects. Taken together, I hope this dissertation offers a useful base on top of which other analysts might build in the pursuit of rigorous applications of data science methodologies that support equity and mobility in our society.

PHD (Doctor of Philosophy)
data science, education policy, natural language processing
Sponsoring Agency:
National Academy of Education and Spencer Foundation (Dissertation Fellowship Program)Institute of Education Sciences (Virginia Education Science Training Pre-Doctoral Fellowship Program)
All rights reserved (no additional license for public reuse)
Issued Date: