Scrutinizing Resource Utilization for High Performance and Low Energy Computation
Zhang, Wei, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Lach, John, Department of Electrical and Computer Engineering, University of Virginia
Modern processors often suffer from inefficient resource utilization, which leads to inferior performance and energy efficiency. This dissertation scrutinizes the utilization of datapath and cache resources in superscalar processors for opportunities to improve performance and energy efficiency.
Traditional superscalar processors usually employ a one-size-fits-all design approach that allocates a fixed amount of resources for all applications at all times to deliver the best overall performance. However, the one-size-fits-all approach is not always energy efficient, because both the application behavior and the use scenario are changing all the time and the demand for processor resources is also changing accordingly.
To improve the utilization of datapath resources, this dissertation proposes an adaptive processor that dynamically allocates datapath resources based on the needs of applications and use scenarios. The adaptive processor is applied to two use cases to improve energy efficiency. In the first use case (front-end throttling (FET)), the adaptive processor dynamically throttles the front-end instruction delivery bandwidth as program behavior changes to optimize a target metric, being performance, energy, or an arbitrary trade-off between them. In the second use case (dynamic core scaling (DCS)), the adaptive processor extends performance-energy tradeoff capabilities in superscalar processors by scaling datapath resource rather than voltage. The adaptive processor ensures that programs run at a given percentage of their maximum speed and, at the same time, minimizes energy consumption by dynamically adjusting the active superscalar datapath resources. DCS is more effective in performance-energy tradeoffs than DVFS at the high performance end. When used together with DVFS, DCS significantly extends the range of performance-energy tradeoffs.
Caches also suffer from inefficient utilization in modern processors. To minimize the access latency of set-associative caches, the data in all ways are read out in parallel with the tag lookup. However, this is energy inefficient, as only the data from the matching way is used and the others are discarded. To improve the utilization of the L1 instruction cache, this dissertation proposes an early tag lookup (ETL) technique for L1 instruction caches that determines the matching way one cycle earlier than the cache access, so that only the matching data way need to be accessed. ETL incurs no performance penalty and insignificant hardware overhead, but dramatically reduces the read energy of L1 instruction cache.
For memory intensive workloads, caches often suffer from thrashing, i.e., high-reuse blocks evicting each other from the cache due to the lack of space. To reduce thrashing, only a fraction of the working set should be kept in the cache, so that at least this fraction stays longer in the cache to enable reuse before eviction. However, prior insertion policies take an ad hoc approach to selecting that fraction, e.g., inserting blocks with high priority at fixed or randomly determined fractions, thus limiting the performance impact. This dissertation observes that the optimal fraction of the workload that should be kept in the cache is related to the cache block reuse distance distribution (RDD) of the application. Based on this observation, this dissertation provides an oracle analytical model to determine this optimal fraction assuming that the reuse distance (RD) of each block is known by oracle, and a practical model that is applicable without the oracle RDD. It then proposes simple runtime mechanisms to determine the optimal fraction for each workload dynamically. Our models are orthogonal to prior insertion policies and can significantly improve the performance of the prior state-of-the-art insertion policies when applied on top of them.
Evaluating new energy efficient processors requires accurate information of area, delay, and power of the new architectures. Prior works on energy efficient processor architectures either fail to obtain such information or rely on modeling frameworks such as Wattch and McPAT, which are of limited accuracy. Unlike prior works, this dissertation uses FabScalar, a circuit-level infrastructure, to accurately evaluate area, delay, and power of the new architectures.
PHD (Doctor of Philosophy)
High performance, Low energy, Cache, CPU, Adaptive processing, Low power, Dynamic voltage and frequency scaling, Processor, Performance energy tradeoff