Addressing Processor Over-Provisioning on Large-Scale Multi-Core Platforms
Wang, Wei, Computer Science - School of Engineering and Applied Science, University of Virginia
Davidson, Jack, Department of Computer Science, University of Virginia
Soffa, Mary, Department of Computer Science, University of Virginia
Modern micro-architectures have embraced multi-core processors and thread-level parallelism for performance growth, because of the difficulty of increasing single core performance without significantly increasing processor power consumption. To meet the ever growing need for speed, current large-scale computing platforms are Non-uniform Memory Accesses (NUMA) architectures equipped with dozens of cores, while the prediction is that future large scale systems will have hundreds or even thousands of cores. The applications executing on these platforms are usually multi-threaded applications which create large numbers of threads to simultaneously utilize the massive numbers of cores.
When executing multi-threaded applications on large-scale platforms, users and run-time systems typically allocate all available cores to their applications. However, because of the insufficient memory bandwidth on these large-scale platforms, many multi-threaded applications achieve their best performance when using only a portion of all available cores. Allocating all cores over-provisions these applications, degrades performance, and reduces energy efficiency and system throughput. Therefore, it is desirable to execute multi-threaded applications with the minimal core allocation that achieves best performance. We call this allocation, the optimal core allocation. However, determining the optimal core allocations for various applications on various hardware is very difficult.
Because memory bandwidth is the primary determining factor of optimal core allocations, we chose to predict optimal core allocations based on the prediction of memory bandwidth usage of multi-threaded applications. Accurately predicting memory bandwidth usage faces three major challenges: the random contention and concurrency in DRAM; the inter-processor connections with unknown properties; and the heterogeneity within the memory system. We thoroughly analyzed the memory system on large-scale NUMA platforms and discovered three important insights. First, DRAM contention and concurrency have stable statistical distributions. Second, inter-processor connections act like networks and have linear properties. Third, the memory system can be modeled as an integer problem. These insights allowed us to utilize probability theory and Mixed Integer Programming to predict memory bandwidth usage with high accuracy and low-overhead. Based on the bandwidth prediction, we are able to highly accurately predict optimal core allocations in a short amount of time.
Our models predict optimal core allocations during application execution. Therefore, applying our models requires an efficient technique to dynamically adapt an application to its optimal core allocation during run-time. Moreover, our models require run-time information about application memory behavior and hardware configuration to make predictions. We designed a run-time technique that allows multi-threaded applications to efficiently adapt to their optimal core allocations during execution. To ensure low overhead and achieve near-ideal load-balancing on any core allocation, this run-time technique employs massive concurrent threads and distributed synchronization primitives. We also designed a low-overhead framework to support memory behavior profiling and hardware configuration detection at run-time by querying hardware performance counters and registers.
Combining the models and run-time techniques, we developed the OptiCore run-time system to automatically execute multi-threaded applications with their optimal core allocations on large-scale NUMA platforms. Compared to use-all-cores allocations, OptiCore provides a maximum speedup of 4.84 . Additionally, the minimal core allocation used by OptiCore only allocates 12.5% of all cores. On average, OptiCore allocates only 88.7% of all cores and achieves 34.6% performance improvement.
PHD (Doctor of Philosophy)
NUMA, Large-scale Computing, Core Allocation, DRAM Modeling, NUMA Memory System Modeling, Performance Optimization, Run-time Execution Management, Distributed Synchronization, Multi-threading
All rights reserved (no additional license for public reuse)