Network Models with Alternative Sampling Methods
Chan, Ga Ming Angus, Statistics - Graduate School of Arts and Sciences, University of Virginia
Lubberts, Zachary, AS-Statistics (STAT), University of Virginia
Networks are an incredible tool to model complex, dependent data and recover insights
otherwise hidden. As data engineering technology advances, the abundance
and complexity of data skyrocket. The investigation of the relationship between entities
in a system sparks rising interest, such as social networks, citation networks,
and neural systems, to name a few. Motivated by the demand, numerous network
models have been proposed and proven to be adaptive to various scenarios.
Yet, models are not the only concern. The sampling mechanism plays a crucial role
in truly understanding data. While modeling is mainly about describing data generation,
properly recognising how data is observed is just as important. It is not an
exaggeration to say that if the sampling mechanism is misspecified, a model is fundamentally
inaccurate to the truth. In my two projects, we study data obtained under
two specific sampling mechanisms and present models and their theoretical properties
tailored to these situations.
Egocentric sampling is the backbone of Chapter 2. This sampling design assumes
that only a subset of the entities in the community and interactions involving at least
one of them are observed. It is designed to resemble the data collection process of
social networks. As data privacy is gaining emphasis, we believe that this sampling
setting will gain increasing usage. We show that our model excels in both simulation
study and real networks compared to benchmarking methods, and present theoretical
guarantees on the accuracy of our predictions.
Interaction hypergraphs in Chapter 3 are inspired by edge-exchangeable sampling.
Contrary to the common assumption that entities are the sampling units, our model
assumes that edges are the sampling units instead. Such preference is motivated by
real-life networks, such as brain connectivity networks that observe synapses instead
of neurons, email networks that observe emails instead of users, and so on. The
difference in assumption will result in different probability distributions and require
separate modeling strategies. Furthermore, instead of requiring edges to involve only
2 nodes, hypergraphs allow interactions to involve more than 2 nodes and occur multiple
times with the same set of nodes. Even though high-degree interactions can be
reduced to edges of all the pairs of nodes involved, they are still conceptually different,
which we will illustrate below. Finally, we present the theoretical properties and
evaluate the clustering performance of our method.
PHD (Doctor of Philosophy)
Spectral clustering, Nonuniform hypergraphs, Random hypergraphs, Network inference, Random graphs
English
2025/05/01