Designing Efficient, Elastic, and High-Throughput LLM Serving Systems

Su, Zhaoyuan

Designing Efficient, Elastic, and High-Throughput LLM Serving Systems 142 views

Author

Su, Zhaoyuan, Computer Science - School of Engineering and Applied Science, University of Virginia 0000-0003-4058-2343

Advisors

Cheng, Yue

Abstract

Large language models (LLMs) have transitioned from research prototypes to production-critical infrastructure, yet the systems that serve them face compounding pressures from explosive model growth, bursty workloads, and the computational complexity of sparse architectures. This dissertation argues that these pressures can be systematically addressed by treating key system resources—storage, memory, interconnection, and compute—not as fixed constraints but as adaptable, schedulable substrates whose configurations are co-designed with workload characteristics. We instantiate this thesis through three complementary systems that operate at three scales of adaptation.
Structural adaptation. We present ELF, an error-bounded, near-lossless floating-point compression method for pre-trained model (PTM) storage. A comprehensive study of 900 real-world PTMs reveals that 98.91% of parameters fall within (−1, 1); ELF exploits this by eliminating the IEEE 754 exponent field. The ELVES framework, combining ELF with deduplication and dictionary coding, achieves a 1.52× compression ratio, with near-zero accuracy loss.
Runtime adaptation. We present MorphServe, a dynamic LLM serving framework that reversibly morphs model precision and KV cache capacity during ongoing inference. MorphServe's LayerSwapper and KVResizer mechanisms, guided by sensitivity-based profiling and asynchronous CUDA-stream execution, reduce SLO violations by 92.45% and P95 time-to-first-token by 2.2–3.9× compared to full-precision serving, while preserving generation quality.
Architectural adaptation. We present ZeRO-Prefill, a prefill-only serving system for Mixture-of-Experts (MoE) models. Its AsyncEP execution paradigm replaces synchronous per-layer activation routing with background expert-weight streaming, eliminating triple redundancy in computation, memory, and communication. ZeRO-Prefill delivers 1.35–1.37× throughput over the strongest baseline across four hardware configurations, while sustaining 29.8–36.2% per-GPU Model FLOPs Utilization.
Together, these contributions span the full LLM serving lifecycle—storage, runtime, and distributed execution—and demonstrate that resource adaptation is a unifying and effective design principle for building efficient, elastic, and high-throughput serving systems.

Degree

PHD (Doctor of Philosophy)

Keywords

Large language model serving; Model compression; Mixture-of-Experts; Dynamic resource adaptation; Distributed inference; LLM serving

Language

English

Issued Date

2026-04-26

Suggested Citation

Su, Zhaoyuan. Designing Efficient, Elastic, and High-Throughput LLM Serving Systems. University of Virginia, Computer Science - School of Engineering and Applied Science, PHD (Doctor of Philosophy), 2026-04-26, https://doi.org/10.18130/krnh-1d62.

Designing Efficient, Elastic, and High-Throughput LLM Serving Systems 142 views

Author

Advisors

Abstract

Degree

Keywords

Sponsors

Language

Rights

Issued Date

Persistent Link

Suggested Citation

Files