Towards Efficient Distributed Machine Learning Systems
Chai, Zheng, Computer Science - School of Engineering and Applied Science, University of Virginia
Cheng, Yue, DS-Faculty Affairs, University of Virginia
The exponential increase in data generation across a multitude of domains, including healthcare, finance, social media, and IoT, necessitates the development of advanced computational systems to efficiently process this information. Due to limitations in computational power, memory and privacy concern in data transfer, traditional centralized machine learning systems struggle to handle it. As datasets continue to grow in size and complexity, there is a urgent need for efficient distributed machine learning systems that can leverage the power of multiple computing nodes to handle large-scale datasets. This dissertation explores the design, implementation, and optimization of such systems from two critical perspectives: federated learning and distributed Graph Neural Network (GNN) training. Each approach addresses unique challenges and opportunities in improving computational efficiency, scalability, and model performance.
From the federated learning perspective, this dissertation aims to mitigate both the data heterogeneity and the resource heterogeneity which are commonly exist in federated learning. Key contributions include adaptive tier selection strategy, weighted aggregation algorithm, combination of synchronous and asynchronous training manners, which collectively enhance the scalability and robustness of federated learning systems.
From the distributed GNN training perspective, we address the unique challenges posed by graph-based data structures, which involve complex dependencies and irregular computation patterns. This research introduces novel methods for efficient distributed GNN training, focusing on how to leverage the missing node representations for each subgraph. We design a key-value storage system to pull and push stale representations. This approach mitigates the information loss due to graph partitioning, while efficiently avoiding overhead caused by utilizing real-time out-of-subgraph node representations. This work optimizes resource utilization and reduce training time while maintaining model accuracy and stability.
Experimental results demonstrate significant improvements in training efficiency and scalability for both federated learning and distributed GNN training. We show the effectiveness of our approaches on various machine learning tasks, including image classification, natural language processing and graph-based learning.
PHD (Doctor of Philosophy)
distributed machine learning, distributed systems, distributed graph neural networks, federated learning, synchronous training, data heterogeneity, resource heterogeneity, asynchronous training
English
All rights reserved (no additional license for public reuse)
2024/07/29