Abstract
With the growth of the internet and rise of new digital technologies such as cloud computing, artificial intelligence, and big data, facilities housing thousands of computers and servers–called data centers–have been constructed in order to support these recent advancements in computing technology. However, these data centers can greatly affect the communities that surround them, whether it be from an economic, cultural, or environmental standpoint. My STS topic focused on how these communities are affected by the development of data centers, whereas my capstone focused on different algorithms that could improve the efficiency of the caching within these data centers.
For my STS topic, I took two research approaches: a meta analysis of various papers to view how individuals feel about the development of data centers and seeing the potential environmental effects of these facilities; and an analysis of Census data comparing demographic data of six communities–three with data centers, between 2010 and 2020. Specifically, I chose an experimental group of Loudoun County, Virginia, Prineville, Oregon, and Quincy, Washington due to their presence of data centers. I also chose a control group of neighboring counties that did not have data centers which consisted of: Clarke County, Virginia, Wasco, Oregon, and Douglas, Washington. I compared how the median household income, total population, and number of college-educated individuals changed over time between the six counties in order to see if there was a correlation between the development of data centers and the demographics of a given community. In order to accomplish all of this, I used Python for its libraries that focus on data analysis, specifically the “NumPy”, “pandas”, “seaborn”, and “matplotlib” libraries.
For my STS topic, I was able to find valuable correlations between the demographics of a community and the presence of a data center there. For example, communities with data centers tended to have more college-educated individuals, as well as higher median household incomes. Despite the potential energy costs that data centers require, they seem to invite a more-educated and well-off population to the area. It was incredibly different specifically between Clarke County and Loudoun County in Northern Virginia, where there was a clear difference in the income and population demographics between the two communities.
For my technical topic, I focused on two different caching algorithms with the programming language Go in order to compare their performances in parallel processing. Caching is the process of storing data in an intermediary data store between end-users and the whole database, allowing for rapid retrieval of frequently-used data. Cache eviction occurs when the cache is full, and it has to decide what to remove when new data is added from the database. Specifically, I focused on the Least Recently Used (LRU) and the SIEVE algorithms. The LRU algorithm relies on removing the data point with the oldest access-time whenever the cache is full. SIEVE marks all data as “visited” or “unvisited” since the previous eviction, and the algorithm uses a “hand” pointer that sifts through the cache until it finds an “unvisited” node to evict. The simulation itself involved creating a “cache” data-structure and a data-store. Both data structures followed a key-value pair, where each data point was considered a “value”, and it's given a “key” that indicates its location in the data-store or the cache. Afterwards, two different simulations were run with each algorithm: one with high temporal locality, where the data called was typically close to each other; the other with low temporal locality, where the data called was spread out through the data-store. In order to make the technical project reflect real-world circumstances where multiple machines will be interacting with the cache, I took advantage of the “channel” feature in Go in order to allow for multiple different CPU cores to access the simulation at once.
After conducting the simulations with Go, I realized that the LRU and SIEVE algorithms were both very successful in different situations. LRU performed much better when used in the high temporal locality, whereas SIEVE did better when used with low temporal locality. I predict this was due to LRU being very slow to demote a data point, allowing it to keep nearby data for a longer duration. On the other hand, SIEVE demotes nodes very quickly, which helps keep the cache emptier for far apart data points to get promoted to the cache.
Over the past school year, I was able to get a deeper understanding of the infrastructure that supports all of the recent technology that we use on a both physical and an algorithmic level. Using real Census data to make conclusions about data center development was very interesting, and creating my own simulation of a cache proved to be a great project to further understand the ways engineers store data on a large-scale.