Variable Selection on Compositional Data Based on Variable Deletion

Perez-Suarez, David, Statistics - Graduate School of Arts and Sciences, University of Virginia
Zhou, Jianhui, Statistics, University of Virginia

Compositional data consists of proportions or percentages of compositions, which are usually positive vectors, with the relevant information being the ratios between their components. The unique feature of compositional data is that the observed values of compositional variables sum to 1 for each subject, and this feature makes the selection for informative variables challenging when dimensionality is high since many of the existing variable selection methods cannot accommodate this data structure. Compositional data appears in a wide range of applications such as geology, consumer demand analysis, forensic science, etc., and an effective variable selection method for such data is highly desired. In this work, we developed a variable selection method for compositional data in a linear regression model. The developed method is based on the deletion of the subsets of the variables and the corresponding changes in the coefficient of determination. The deletion method was computed efficiently. The numerical performance of the developed method is satisfactory in simulation studies. This variable selection method for compositional data can also be generalized for more complicated models.

PHD (Doctor of Philosophy)
All rights reserved (no additional license for public reuse)
Issued Date: