Efficient and Interpretable learning on Graph Structures

Author: ORCID icon orcid.org/0009-0006-0360-4928
Qi, Mingyu, Statistics - Graduate School of Arts and Sciences, University of Virginia
Li, Tianxi, Statistic, University of Virginia

Advances in data collection and the proliferation of social media have brought network data into various fields, including the social sciences, economics, transportation, and biology. Efficient structural learning on networks (e.g., node groupings and the frequency of specific patterns, such as triangles, within graphs) is crucial for understanding complex system dynamics, predicting behavior, and devising interventions to enhance system performance. This thesis presents novel principled statistical tools for structural learning on networks, emphasizing appealing statistical properties and computational efficiency.

The first project features a non-overlapping and separable penalty that approximates the overlapping group lasso penalty. Overlapping group lasso penalty is often used to introduce structured sparsity in statistical learning given the penalty’s ability to eliminate predefined groups of parameters. However, when groups overlap, solving the group lasso problem can be time consuming in high-dimensional settings due to groups’ non-separability. This issue has constrained the penalty's relevance to cutting-edge computational areas, such as gene pathway selection and graphical model estimation. The proposed approximation greatly improves the computational efficiency of optimization, especially for large-scale and high-dimensional problems. This project confirms the proposed penalty as the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Furthermore, estimators based on the proposed norm are statistically equivalent to those derived from overlapping group lasso in terms of estimation error, support recovery, and the minimax rate under the squared loss.

Scientists have broadly leveraged differential co-expression analysis to clarify diseases’ biological mechanisms. Yet the unknown differential patterns tend to be complicated. As such, models based on simplified parametric assumptions may not pinpoint all discrepancies. Meanwhile, biological theory further indicates that, rather than existing in isolation, genes operate in groups to perform biological functions. Thus, differential co-expression analysis becomes more meaningful when genes’ co-functioning structure is taken into account. The second project develops a new means of identifying differentially correlated gene groups. This technique enjoys computational efficiency and accuracy while integrating group information in analysis. As a by-product, a parameter-tuning procedure is put forth which considers the structural assumption and outperforms standard methods. Multiple simulation examples and a differential analysis of vascular smooth muscle cell gene expression data substantiate its utility.

Network moments measure the frequency of specific patterns and are instrumental in network analysis. Subsampling techniques serve as tools for approximating network moments' distribution in scenarios where limited networks are available. Although the univariate distribution of network moments can be well approximated, fairly little is known about the consistency for their joint distribution. The third and final project in this thesis focus on estimating the joint distribution of network moments through network node subsampling, thereby aiding in the characterization of the network's distribution. Our contribution opens the door for computationally efficient and interpretable network analysis innovations. For example, a multivariate inference approach is illustrated by analyzing collaboration networks and guppy gene expression networks.

PHD (Doctor of Philosophy)
Network analysis, High-dimensional data, Structural learning
Issued Date: