Using Machine Learning K-Means Clustering to Comprehend California Housing Prices; Analyzing the Social Implications of Outlier Removal on Predictive Models

Nguyen, Anh

Using Machine Learning K-Means Clustering to Comprehend California Housing Prices; Analyzing the Social Implications of Outlier Removal on Predictive Models 329 views

Author

Nguyen, Anh, School of Engineering and Applied Science, University of Virginia

Advisors

Francisco, Pedro Augusto , EN-Engineering and Society , University of Virginia
Morrison, Briana , EN-Comp Science Dept , University of Virginia

Abstract

California is known for its high housing prices and its high-demand for housing, and my capstone research addresses how people can better understand housing price trends in California. To better understand the housing price trends in California, I made a program that uses the K- means machine learning algorithm to cluster different characteristics in the California Housing dataset from Kaggle. This program gives insight on what houses in different regions of California have in common and how these factors can affect the prices of the houses. This program also explores the K-means clustering algorithm.
It is important to consider the human and social dimensions of my program. While my program can help give insight into characteristics that determine housing prices in California, it generalizes a lot of the characteristics and focuses more on clustering the houses into different regions. When looking at these limitations, we can see how my program would not be a technology that should be readily available to the public since it is lacking in different aspects. It also does not reference actual social cases where pricing is unequal and only looks at the data as numbers. To further analyze my capstone research, ethics of care and the relational view theoretical frameworks can be applied. Actor network theory would also work for analyzing since it can address human and non-human factors associated with housing prices and with the technology itself.
For my STS research, I am analyzing the social implications of outlier removal on predictive models. For my methodology, I conducted literature reviews and used ethics of care and the relational view theoretical frameworks to further analyze the different studies. Through my STS research, I explored how predictive models clean, pre-process, and process data. I also saw how outlier removal, specifically during pre-processing, is managed and the role of outliers
or how they are viewed in this field. When looking at both topics of research, we will be able to see how these predictive models end up with skewed or biased results based on how the models’ training data is cleaned. We can also see what social groups end up being excluded from these models, the researchers’ view on data in general, and whether or not the way this data is viewed needs to change.

Degree

BS (Bachelor of Science)

Keywords

Machine learning; K-means clustering; Data pre-processing; Ethics of Care; Relational view

Notes

School of Engineering and Applied Science
Bachelor of Science in Computer Science
Technical Advisor: Briana Morrison
STS Advisor: Pedro Augusto Francisco

Language

English

Rights

Attribution 4.0 International (CC BY)

Issued Date

2023-05-09

Persistent Link

https://doi.org/10.18130/m6sb-f275

Suggested Citation

Nguyen, Anh. Using Machine Learning K-Means Clustering to Comprehend California Housing Prices; Analyzing the Social Implications of Outlier Removal on Predictive Models. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2023-05-09, https://doi.org/10.18130/m6sb-f275.

Files

Nguyen_Anh_Prospectus.pdf

Downloads: 1127

Download

Nguyen_Anh_STS Research Paper.pdf

Downloads: 1043

Download

Nguyen_Anh_Sociotechnical Synthesis.pdf

Downloads: 745

Download

Nguyen_Anh_Technical Report.pdf

Downloads: 1603

Download