Abstract
One of the key consequences of the fast growth of large language models (LLMs) is the emergence of a critical tension between efficiency and sustainability within current computing systems. As LLMs keep growing, their usage requires increasingly more high-performance GPU computing resources to be available 24/7, even when there is no immediate demand. Although this solution ensures low latency and throughput, it results in wasteful energy use and high costs.
The increasing need for advanced infrastructure for LLM usage poses additional problems, such as the environmental and social impacts of building, maintaining, and operating computing facilities, as well as unequal access to such resources. In addition, the rising demand for computing facilities highlights the sociotechnical question of how to develop and manage AI models in a way that is both effective and socially sustainable.
This project focuses specifically on deployment and computing infrastructure, while other related concerns remain outside its scope.
The technical section of this thesis examines different ways to deploy LLM inference. It focuses on how resource utilization can be optimized without sacrificing performance. The three methods considered in this project are always-on deployment, a naive serverless strategy, and the keep-warm serverless method. To test them, a prototype was developed using an NVIDIA L4 GPU.
From the testing, it is evident that the always-on method delivers the lowest latency, with warm latency at about 0.02 seconds and strong throughput. However, it wastes resources as it continues to consume GPU memory and power even when idle. On the other hand, the naive serverless deployment eliminates idle resource usage and conserves GPU memory when not in use, but suffers from significant cold-start latency, which may exceed tens of seconds.
The keep-warm serverless strategy strikes a balance by maintaining nearly the same level of latency, but it does not fully eliminate the inefficiency associated with idle resource usage.
The analysis of large language models from the perspective of STS research involves examining infrastructure-related issues in terms of access, sustainability, and control. Instead of viewing AI development only through its technical dimensions, this research considers how the development process depends on the availability of computational infrastructure.
Specifically, the study demonstrates that AI development has increasingly come to depend on access to large-scale computational infrastructure, leading to a “compute divide” between organizations and individuals who have access to such resources and those who do not. Evidence indicates that the operation of AI systems has a significant environmental footprint, characterized by high energy consumption and intensive data center operations. For instance, AI infrastructure is predicted to consume hundreds of terawatt-hours of electricity globally, raising concerns about sustainability.
At the same time, control over such infrastructure largely remains in the hands of major technology firms, limiting participation and shaping the direction of development. In this regard, serverless computing can be seen as an alternative approach to infrastructure design. Although it is not a complete solution, it shows that different infrastructure models are possible.
The combination of these two works is important for gaining a better understanding of LLMs as both sociotechnical and technical systems. While the former reveals the technical constraints of existing approaches, the latter highlights structural constraints in AI infrastructure development. Despite the advantages of a serverless approach, challenges remain in balancing efficiency, sustainability, and accessibility.
In the future, further work should focus on exploring more complex infrastructure designs, such as partial model loading and GPU sharing.