Impact of Data Center Infrastructure on Server Availability - Characterization, Management and Optimization

Sankar, Sriram, Computer Science - School of Engineering and Applied Science, University of Virginia
Gurumurthi, Sudhanva, Department of Computer Science, University of Virginia

As cloud computing, online services and user storage continue to grow, large companies continue building data center facilities to serve end user requirements. These large data center facilities are warehouse-scale computers in their own right and the cost efficiency of such data centers is critical for both cloud and enterprise business. Data center infrastructure can be partitioned logically into IT infrastructure (server and network), Critical Environment infrastructures (power, cooling) and Management infrastructure that coordinates all the other infrastructures. Although the IT component of the data center is crucial for applications to run, almost one-third of the total cost of ownership in a data center is spent towards building and operating the critical environment infrastructure. Data center operators strive to reduce the cost of the critical environment infrastructure, in order to increase the server portion of the investment. However, reduction of this cost usually comes at the expense of increase in failures or unavailability of the server infrastructure. In this work, we explore the impact of data center infrastructure on server availability – we first characterize server component failures with respect to temperature (Cooling System), evaluating the relationship between server hard disk drive failures and temperature in detail. We then evaluate power availability events and their impact on data center power provisioning (Power System). We then focus on the critical management infrastructure that coordinates all of the infrastructure, and propose a novel, low-cost, wireless-based management solution for data center management (Management System). We also present a new class of failures in data centers (Soft Failures), which results in service unavailability, but does not need actual hardware replacements.

PHD (Doctor of Philosophy)
Data Center, Availability, Server, Infrastructure
All rights reserved (no additional license for public reuse)
Issued Date: