Improving System Performance via Design and Configuration Space Exploration

Author: ORCID icon
Tang, Chong, Computer Science - School of Engineering and Applied Science, University of Virginia
Sullivan, Kevin, En-Comp Science Dept, University of Virginia
Ray, Baishakhi, En-Comp Science Dept, University of Virginia

The runtime performance of complex software systems often depends on the settings of a large number of static system configuration and design parameters. For example, big data systems like Hadoop present hundreds of configuration parameters to engineers. Many of them influence runtime performance, and some interact in complex ways, which make the configuration spaces for such systems large and complex. It is hard for engineers to understand how all these parameters affect performance, and thus it is hard to find combinations of parameter values that achieve available levels of performance. The result in practice is that engineers often just accept default settings or design decisions made by tools, leading such systems to underperform significantly relative to their potential. This problem, in turn, has impacts on cost, revenue, customer satisfaction, business reputation, and mission effectiveness. To improve the overall performance of such systems, we addressed three sub-problems.

The first is that we lack adequate concepts, methods, and tools to do tradeoff space analysis to find high-performing designs, where performance is assessed in multiple dimensions, particularly time and space utilization. To address this problem, we conducted the first systematic evaluation of an approach to finding Pareto-optimal system designs in such tradeoff spaces previously developed by Bagheri et al. This approach starts with the relational logic specification of a system for which a design is to be found. This specification is conjoined with a separate specification of the mapping from system specifications to tradeoff spaces, comprising sets of designs, each consistent with a given specification. Second, this approach uses additional synthesis techniques to implement each design in the tradeoff space as a running system. The framework that we developed then uses large-scale distributed computing based on Spark to parallelize the profiling of each such implementation using a synthesized benchmark test suite to measure its relevant performance characteristics. Our framework then selects the subset of Pareto-optimal designs for presentation to the engineer, who makes final tradeoff decisions. Building on the prior work of Bagheri et al., we evaluated the feasibility and benefits of the tradeoff space analysis approach with experiments in the domain of object-relational mapping (ORM) problems. The challenge, given an object-oriented (OO) data model, is to find a relational database schema for storing and searching such data efficiently. In general, there are many such schemas, with varying performance characteristics. Our results show that our approach can find schemas for OO data models (with tens of classes and relationships) that significantly outperform, in both time and space performance, those produced by widely used ORM tools, such as Ruby on Rails and Django.
This work demonstrates the potential for formal tradeoff space synthesis and profiling to significantly improve the performance of component-level design problems in industrially meaningful systems.

Second, we lack approaches for developing accurate models for predicting the performance of highly configurable big-data systems a function of their configurations and the complexity of jobs running on them. Having such predictive models could significantly reduce the cost of finding good configurations for such systems. This is a hard problem because such mappings are not straightforward in general. Our overall approach is to derive such models by applying machine learning techniques to training data obtained from industrial big-data systems specifically Hadoop. The particular problem that we addressed is that learning such mappings tends to produce inaccurate models. Our approach was to find new ways to eliminate noisy values from training data. The first step is simply to ignore parameters that could have no bearing on performance. Our additional novel technique was to further exploit the semantic meanings of configuration parameters and the job complexity to eliminate noisy values. Our results show significant improvement in prediction accuracy of learned models with such semantically cleaned data. We evaluated our approach by building models for the Hadoop MapReduce framework, with and without using our semantic data cleaning approach on data obtained from Walmart Labs. In our experiments, our approach improved model accuracy from under $0.75$ to over $0.88$ for CPU time prediction. These results suggest that semantic cleaning of data provides real value for training performance prediction models for large big-data systems.

The third problem that we addressed was to find high-performing configurations for systems like Hadoop without the need to learn generalized performance prediction models. Learning such models is costly. Moreover, we observed that industrial uses of such systems tend to under-sample configuration spaces, making it hard to learn generalized models from such data. The challenge we addressed is to find a more effective method: one that doesn't rely on predictive models at all. Our overall approach is to employ a direct meta-heuristic search of performance-relevant configuration parameter values. For each such configuration, we profile a running system using the well known HiBench benchmark suite. Novel aspects of our approach include: 1) the use of evolutionary Monte Carlo Markov Chain configuration space search technique; 2) profiling performance using HiBench small datasets as proxies for the performance objective function for much larger datasets; and 3) validating the usefulness of configurations found for one MapReduce problem on classes of problems with similar resource usage patterns as judged using operating system call monitoring tools.

We show that, at an acceptable cost, this combination of ideas can find configurations for complex systems that significantly improve their runtime performance. To evaluate our approach, we applied it in the domain of big-data systems, particularly Hadoop and Spark. We measured performance improvements achieved by our approach and by competing approaches by comparing performance using configurations discovered by these approaches against performance using the widely employed default configurations distributed with these big data systems. We compared our approach against three competing approaches by comparing the extents to which they improved system performance: 1) a naive random search strategy; 2) a basic genetic algorithm; 3) and the cutting-edge, model-learning-based approach of Nair et al. Our approach improved the performance of five canonical Hadoop jobs by 14% to 25%, relative to default configurations, and of another five Spark jobs by 2% to 17%, significantly outperforming the three competing approaches.
These experimental results support the claim that our approach has significant potential to improve the performance of industrial big-data systems.

This dissertation solves three problems to improve system performance with novel software synthesis + distributed large-scale profiling, learning with parameters' semantic meanings, and meta-heuristic searching combined with cost-saving tactics. The results show that we can significantly improve system performance in critical engineering domains, such as ORM design and big-data infrastructures tuning.

PHD (Doctor of Philosophy)
System optimization, Performance optimization , Design space, Configuration space, Space search, Search-based optimization
All rights reserved (no additional license for public reuse)
Issued Date: