Fair LLM Serving Through Balanced GPU KV Cache Allocation

Author:

orcid.org/0009-0003-6135-5841
Peng, Weihe, Computer Science - School of Engineering and Applied Science, University of Virginia

Advisor:
Shen, Haiying, EN-Comp Science Dept, University of Virginia

Abstract:

Fairness is an important aspect of large language model (LLM) serving systems. Existing systems and research efforts employ metrics, such as request rate and output length estimations, to monitor and regulate service consumption, preventing users from unfairly monopolizing system resources. A key resource bottleneck for token generation rates in LLM is the GPU Key-Value (KV) cache. While vLLM recognizes the importance of efficient GPU KV cache utilization for improving LLM performance, its default job scheduler fails to maintain balanced KV cache allocation among tenants, resulting in unfair token generation rates. To address this issue, we introduce the cumulative block counter measure to track per-tenant GPU memory consumption over scheduling iterations. Based on this measure, we propose four improvements that can be progressively added to improve fairness in GPU KV cache usage: Running Priority Scheduler, which prioritizes tasks from tenants with lower historical GPU KV cache usage; Periodic Swapping, which ensures that swapped-out tasks regain GPU access at regular intervals; Swap-In Priority Scheduler, which selects the swap-in task in order to prevent unfair delays; and Counter Decay, which gradually reduces accumulated block counts for inactive tenants, preventing long-term penalization. By dynamically balancing the GPU KV cache access, our approach reduces disparities in token generation rates across tenants while maintaining high scheduling efficiency without introducing significant overhead. Our evaluation shows up to 4% improvement in both mean and 99th percentile fairness, and with up to 18.71% reduction in task completion latency across varied scheduling environments.

Degree:
MS (Master of Science)

Keywords:
LLM Fairness, KV Cache

Language:
English

Rights:
Attribution 4.0 International (CC BY)

Issued Date:
2025/04/15

Persistent Link:
https://doi.org/10.18130/es26-yc26

Fair LLM Serving Through Balanced GPU KV Cache Allocation

Files