Serverless Deployment for Large Language Models: Reducing Latency, Cost, and GPU Idle Time; The Political Economy of AI Infrastructure: Sustainability, Access, and Inequality in Large Language Models

Chang, Jae-Hyuk

Serverless Deployment for Large Language Models: Reducing Latency, Cost, and GPU Idle Time; The Political Economy of AI Infrastructure: Sustainability, Access, and Inequality in Large Language Models 67 views

Author

Chang, Jae-Hyuk, School of Engineering and Applied Science, University of Virginia

Advisors

Shen, Haiying , EN-Comp Science Dept , University of Virginia
Wayland, Kent , EN-Engineering and Society , University of Virginia

Abstract

One of the key consequences of the fast growth of large language models (LLMs) is the emergence of a critical tension between efficiency and sustainability within current computing systems. As LLMs keep growing, their usage requires increasingly more high-performance GPU computing resources to be available 24/7, even when there is no immediate demand. Although this solution ensures low latency and throughput, it results in wasteful energy use and high costs.
The increasing need for advanced infrastructure for LLM usage poses additional problems, such as the environmental and social impacts of building, maintaining, and operating computing facilities, as well as unequal access to such resources. In addition, the rising demand for computing facilities highlights the sociotechnical question of how to develop and manage AI models in a way that is both effective and socially sustainable.
This project focuses specifically on deployment and computing infrastructure, while other related concerns remain outside its scope.
The technical section of this thesis examines different ways to deploy LLM inference. It focuses on how resource utilization can be optimized without sacrificing performance. The three methods considered in this project are always-on deployment, a naive serverless strategy, and the keep-warm serverless method. To test them, a prototype was developed using an NVIDIA L4 GPU.
From the testing, it is evident that the always-on method delivers the lowest latency, with warm latency at about 0.02 seconds and strong throughput. However, it wastes resources as it continues to consume GPU memory and power even when idle. On the other hand, the naive serverless deployment eliminates idle resource usage and conserves GPU memory when not in use, but suffers from significant cold-start latency, which may exceed tens of seconds.
The keep-warm serverless strategy strikes a balance by maintaining nearly the same level of latency, but it does not fully eliminate the inefficiency associated with idle resource usage.
The analysis of large language models from the perspective of STS research involves examining infrastructure-related issues in terms of access, sustainability, and control. Instead of viewing AI development only through its technical dimensions, this research considers how the development process depends on the availability of computational infrastructure.
Specifically, the study demonstrates that AI development has increasingly come to depend on access to large-scale computational infrastructure, leading to a “compute divide” between organizations and individuals who have access to such resources and those who do not. Evidence indicates that the operation of AI systems has a significant environmental footprint, characterized by high energy consumption and intensive data center operations. For instance, AI infrastructure is predicted to consume hundreds of terawatt-hours of electricity globally, raising concerns about sustainability.
At the same time, control over such infrastructure largely remains in the hands of major technology firms, limiting participation and shaping the direction of development. In this regard, serverless computing can be seen as an alternative approach to infrastructure design. Although it is not a complete solution, it shows that different infrastructure models are possible.
The combination of these two works is important for gaining a better understanding of LLMs as both sociotechnical and technical systems. While the former reveals the technical constraints of existing approaches, the latter highlights structural constraints in AI infrastructure development. Despite the advantages of a serverless approach, challenges remain in balancing efficiency, sustainability, and accessibility.
In the future, further work should focus on exploring more complex infrastructure designs, such as partial model loading and GPU sharing.

Degree

BS (Bachelor of Science)

Keywords

Large Language Models; Serverless Computing; Cloud Computing; AI Infrastructure

Language

English

Rights

Issued Date

2026-05-08

Persistent Link

https://doi.org/10.18130/k8b9-y786

Suggested Citation

Chang, Jae-Hyuk. Serverless Deployment for Large Language Models: Reducing Latency, Cost, and GPU Idle Time; The Political Economy of AI Infrastructure: Sustainability, Access, and Inequality in Large Language Models. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2026-05-08, https://doi.org/10.18130/k8b9-y786.

Files

Chang_Jae-Hyuk_Prospectus.pdf

Downloads: 14

Download

Chang_Jae-Hyuk_STSResearchPaper.pdf

Downloads: 13

Download

Chang_Jae-Hyuk_SociotechnicalSynthesis.pdf

Downloads: 16

Download

Chang_Jae-Hyuk_TechnicalReport.pdf

Downloads: 18

Download