Abstract
Modern data science increasingly confronts the challenge of heterogeneity—systematic variation in parameters, structures, and distributions across individuals, time, and institutions. This dissertation develops new statistical methodologies for modeling, analyzing, and learning from heterogeneous data across three domains: longitudinal dynamics, high-dimensional genomics, and decentralized federated systems.
First, we introduce a dynamic heterogeneous regression framework for longitudinal data that captures time-varying subgroup structure. Heterogeneous effect models have become increasingly prevalent in precision medicine and market segmentation, yet modeling heterogeneity that evolves over time remains a significant challenge. We propose a dynamic subgrouping framework that identifies temporal patterns in covariate effects across latent subgroups, developed from two complementary perspectives. In the regularization-based case, we employ the MDSP penalty to characterize smooth temporal trajectories of subgroup-specific effects. In the model-based case, a fusion penalty is integrated with an EM algorithm to recover local homogeneity within contiguous time intervals.
Second, we address the double-dipping problem in single-cell RNA sequencing (scRNA-seq), where using the same data for clustering and differential expression analysis inflates false discoveries. We develop an integrated framework for joint high-dimensional clustering, automatic feature selection, and valid post-selection inference. Our method filters nuisance genes, identifies stable clusters via penalized likelihood, and controls the finite-sample false discovery rate (FDR) using selective inference theory. Simulations confirm valid error control and improved statistical power, and applications to real scRNA-seq datasets reveal novel cell-type-specific gene programs validated by independent transcriptomic experiments.
Third, we propose a personalized decentralized federated learning (DFL) framework for heterogeneous distributed systems. Unlike classical federated averaging, which imposes a single global model, our method allows clients to maintain personalized local models while benefiting from cross-client information sharing. We employ random graph sampling to enable scalable, server-free peer-to-peer communication with substantially reduced bandwidth requirements. Each client performs locally adaptive updates regularized toward neighborhood consensus, achieving a principled balance between personalization and global knowledge aggregation.
Collectively, this work advances the statistical foundations of heterogeneous data analysis and provides practical tools for understanding variation in complex systems. These methods enable researchers to move beyond population-average effects toward personalized, dynamic, and structured inference at scale. The proposed frameworks have immediate applications in precision medicine, genomics, finance, and privacy-preserving machine learning, while opening new research directions at the interface of statistics, optimization, and distributed computation.