On Security and Privacy Implications of Modifying Machine Learning Datasets
Gao, Ji, Computer Science - School of Engineering and Applied Science, University of Virginia
Mahmoody, Mohammad, Computer Science - School of Engineering and Applied Science, University of Virginia
Machine learning is being increasingly applied to many security and privacy sensitive tasks. Datasets, as the raw material of the machine learning working flow, determine the properties of learned models. In this dissertation, we aim to understand security and privacy implications of different types of modifications made to training datasets. Specifically, we focus on the “small change” regime, where the modification to the training dataset is minimal. We study the following points.
1. Security implications of malicious modifications to the dataset (also called poisoning attacks). We study this scenario from both the view of the learner and the view of the adversary.
- From the view of the learner, we formally define the learnability under instance-targeted poisoning attacks. Our main result shows that PAC learning and certification are achievable if adversary’s budget scales sub-linearly with the sample complexity.
- From the view of the adversary, we prove the existence of polynomial-time adversaries that can amplify any vulnerability with non-negligible chance of happening already.
2. Privacy implications of benign modifications to the dataset. Motivated to meet new legal requirements, many machine learning methods are recently extended to support machine unlearning, i.e., updating models as if certain examples are removed from their training sets, and meet new legal requirements. We formalize (various forms of) deletion inference and deletion reconstruction attacks, in which the adversary aims to either identify which record is deleted or to reconstruct the deleted records. We then present successful deletion inference and reconstruction attacks for a variety of machine learning models and tasks
PHD (Doctor of Philosophy)
machine learning, security, privacy, dataset, theory
English
2022/04/26