Abstract
With the explosive growth of machine learning research defined by the past three decades, AI has evolved from what was an experimental discipline to what can now be defined as global infrastructure. Consider this: in 1998 the state-of-the-art (or frontier) model LeNet was built to identify numbers and consisted of ten-thousand parameters. Today, frontier models (such as ChatGPT) are capable of problem-solving with complex reasoning, consist of trillions of parameters, and are considered as a core tool in many people’s lives. This in turn means that the costs of training have exploded significantly - to the point where training models can cost tens of millions of dollars. This in turn results in a “compute divide” - where only large entities such as megacorporations can contribute meaningfully towards AI research, leaving academic institutions (and the rest of the world) out. Through my technical report, I attack the compute divide by proposing algorithms and network infrastructure to maximally utilize academic resources. In my STS paper, I analyze how the compute divide has affected academic institutions, machine learning research directions, and continuously exploited underprivileged countries.
In our technical report, we demonstrate that the compute divide is not only driven by unequal access to hardware, but also poor utilization of existing resources. In modern AI infrastructure, usually resources are automatically scaled up and down to serve an AI model. This typically entails a “cold-start penalty” - where starting a new instance requires waiting to initialize memory from disk, taking minutes to even hours. We introduce a new method called pyRDMA, which instead of loading resources from disk loads them directly over network other GPUs - bring the cold-start penalty from a factor of hours/minutes to seconds/milliseconds. By doing so, we show that we can narrow the performance (and therefore compute) divide between academic institutions and large companies - fully utilizing resources and maximizing available infrastructure.
In the STS portion, I argue that the compute divide is not just an economic disparity, but a structural one that reshapes how knowledge, power, and participation are done in AI. As the cost of training increases, research authority is concentrated in a set of compute-rich institutions - which has shifted knowledge from open & academic priorities to commercial ones. We explore not only how research directions in academia are affected, but also transparency and external validation/scrutiny are limited. We also explore how on a global scale, the compute divide breaks countries into “compute north”, “compute south”, and “compute desert”. Compute north countries define how AI systems are built, however “compute south” and “compute desert” regions contribute data, labor, and resources without seeing meaningful benefits. I draw on technological politics to show that these outcomes are embedded in the infrastructure of modern AI - building a hierarchy in AI R&D that is strongly political