Applying Genetic Algorithms to the Problem of Variable Selection in Large Datasets with Interaction Terms
Gan, Chee Chun, Systems Engineering - School of Engineering and Applied Science, University of Virginia
Learmonth, Gerard, Frank Batten School of Leadership & Public Policy, University of Virginia
Variable selection is a key step in the development of predictive models. When the size of the dataset is relatively small, greedy algorithms such as stepwise selection perform well in the selection of informative variables. However, as the size of the dataset increases, the challenges faced by such variable selection methods increases rapidly. The addition of interaction terms drastically increases the complexity of the variable selection problem, rendering greedy stepwise selection ineffective.
Past research on the topic has seldom included the effect of interaction terms on predictive modeling. Part of the reason may be the aforementioned difficulty involved in the variable selection process when considering a large dataset. Another possibility is the tradeoff between model accuracy and complexity, where the benefits from including interaction terms may be marginal. However, in certain applications such as medical diagnosis models, any marginal increase in predictive ability may lead to significant improvements in terms of lives saved. In addition, information obtained during the variable selection process such as which interaction terms are significant may serve as a guide for future research efforts to explore why such interaction terms exist among certain primary predictors.
A genetic algorithm (GA) is developed in this study to handle the expanded search space of primary and interaction terms for variable selection. While GAs have been used for variable selection in the past, the chromosome formulation and selection process must be modified to accommodate interaction terms in large datasets. The GA framework is highly flexible and is able to handle a large variety of different models simply by choosing the appropriate fitness function. Experimental runs show that there is benefit to including interaction terms in large datasets in addition to main effects.
PHD (Doctor of Philosophy)
genetic algorithm, variable selection, interaction terms, high-dimensional data
All rights reserved (no additional license for public reuse)