Abstract
Large language models (LLMs) have transitioned from research prototypes to production-critical infrastructure, yet the systems that serve them face compounding pressures from explosive model growth, bursty workloads, and the computational complexity of sparse architectures. This dissertation argues that these pressures can be systematically addressed by treating key system resources—storage, memory, interconnection, and compute—not as fixed constraints but as adaptable, schedulable substrates whose configurations are co-designed with workload characteristics. We instantiate this thesis through three complementary systems that operate at three scales of adaptation.
Structural adaptation. We present ELF, an error-bounded, near-lossless floating-point compression method for pre-trained model (PTM) storage. A comprehensive study of 900 real-world PTMs reveals that 98.91% of parameters fall within (−1, 1); ELF exploits this by eliminating the IEEE 754 exponent field. The ELVES framework, combining ELF with deduplication and dictionary coding, achieves a 1.52× compression ratio, with near-zero accuracy loss.
Runtime adaptation. We present MorphServe, a dynamic LLM serving framework that reversibly morphs model precision and KV cache capacity during ongoing inference. MorphServe's LayerSwapper and KVResizer mechanisms, guided by sensitivity-based profiling and asynchronous CUDA-stream execution, reduce SLO violations by 92.45% and P95 time-to-first-token by 2.2–3.9× compared to full-precision serving, while preserving generation quality.
Architectural adaptation. We present ZeRO-Prefill, a prefill-only serving system for Mixture-of-Experts (MoE) models. Its AsyncEP execution paradigm replaces synchronous per-layer activation routing with background expert-weight streaming, eliminating triple redundancy in computation, memory, and communication. ZeRO-Prefill delivers 1.35–1.37× throughput over the strongest baseline across four hardware configurations, while sustaining 29.8–36.2% per-GPU Model FLOPs Utilization.
Together, these contributions span the full LLM serving lifecycle—storage, runtime, and distributed execution—and demonstrate that resource adaptation is a unifying and effective design principle for building efficient, elastic, and high-throughput serving systems.