Enhancing Performance of Deep Learning Training and Inference on Resource-Constrained Edge Devices

Author:
Sen, Tanmoy, Computer Science - School of Engineering and Applied Science, University of Virginia

Advisor:
Shen, Haiying, EN-Comp Science Dept, University of Virginia

Abstract:

As edge computing gains prominence for its local computation advantages, deep learning (DL) training/update alongside inference on edge devices becomes increasingly relevant. Existing training methods for DL models either rely on centralized scheduling or involve the remote cloud. However, scaling DL models and managing large datasets pose challenges in the edge scenario due to the resource constraints of the edge devices. At the. same time, the inference of the DL models on the edge devices can also be challenging because of the rising advantages of on-device processing by incorporating accelerators such as GPU, NPU, and DSPs. However, as energy-consuming, users may limit their floating point precision. Also, many edge device users are from areas where it is prohibitively expensive for manufacturers to include high-fidelity accelerators. So, low-cost edge devices are equipped with low floating point precision accelerators, sacrificing accuracy.

Simultaneously, in recent days LLMs have transformed natural language processing, enabling advanced tasks such as automated customer support, text generation, and real-time translation. However, deploying these models in real-world edge environments encounters performance bottlenecks, especially in handling KV-cache (KVC) during inference. This KVC bottleneck can lead to frequent preemptions, increased queuing, and high response latencies, all of which are particularly problematic in time-sensitive applications like autonomous systems and healthcare diagnostics. Consequently, achieving low latency while managing memory effectively is essential for enabling LLMs in edge settings, where constrained resources require optimal KVC handling to support real-time processing demands.

This dissertation focuses on minimizing the time of both the DL training and the inference on edge devices by addressing the above issues. In this dissertation, we propose systemic heuristic and Reinforcement Learning-based approaches to reduce the training or inference time without significant loss of accuracy. This work introduces a distributed training system, DMP, emphasizing Data and Model Parallelism. DMP optimizes the training structure by clustering edge devices by leveraging geographically close nodes for data sensing and running the model partitions to reduce the overall training time. Next, the dissertation introduces another system, SROLE, employing Shielded RL for decentralized scheduling in the same data and model parallel training scenario to reduce the load of any single node in a cluster of edge devices. SROLE enables autonomous job scheduling at each edge node, mitigating resource overloading and action collisions in such data and model parallel training scenarios with the same goal of reducing training time. Additionally, we built a system for Fast, Accurate DNN Inference on Low-Cost Edges for the already trained models, which dynamically determines layer assignments across CPU and accelerator using heuristic and RL. Finally, we introduce Mitigating KV Cache Competition to Enhanced User Experience in LLM Inference (CacheOPT) for efficient LLM inference as a critical stepping stone for LLM deployment on edge devices. CacheOPT addresses the KVC bottleneck through confidence-guided KVC allocation based on response length predictions and dynamically adjusts padding to better utilize GPU memory. It also uses a profile-based method to make real-time decisions between recomputation and swapping, based on sequence length. We evaluate our approaches using well-known ML models for various applications to show generality and effectiveness in the real world.

Degree:
PHD (Doctor of Philosophy)

Keywords:
Edge device, ML training and inference

Language:
English

Issued Date:
2024/12/09

Persistent Link:
https://doi.org/10.18130/mky5-q030

Enhancing Performance of Deep Learning Training and Inference on Resource-Constrained Edge Devices

Files