Tensor Modeling of High-dimensional Distributions and Its Applications in Machine Learning

Amiridi, Magda, Electrical Engineering - School of Engineering and Applied Science, University of Virginia
Sidiropoulos, Nikolaos, EN-Elec & Comp Engr Dept, University of Virginia

Effective non-parametric modeling of probability distributions is a central problem in statistics and machine learning. The ubiquity of large amounts of data generated in real-world systems has created unprecedented opportunities to apply such models to critical machine learning tasks for making quick and informed decisions. However, inherent difficulties associated with modern data such as high-dimensionality, incomplete data (vector realizations with missing entries), and complex high-order interactions between features, pose a key challenge and necessitate expressive methods that are efficient both computationally and in memory cost. The main contribution of this dissertation is to introduce principled methods for non-parametric estimation of the data-generating distribution function using low-rank tensor models and show their potential in various real-world applications.

The first part of the dissertation focuses on non-parametric density estimation through the lens of complex Fourier series approximation and low-rank tensor modeling (a low-rank characteristic function approach). We show that any smooth compactly supported multivariate probability density function (PDF) can be approximated by a finite tensor model and under relatively mild assumptions, the proposed model can approximate any high dimensional PDF with approximation guarantees. We also show that, by virtue of uniqueness of low-rank tensor decomposition, assuming low-rank in the Fourier domain, the underlying multivariate density is identifiable. A promising extension of this work, suitable for even higher-dimensional data such as images, considers a joint dimensionality reduction and density estimation framework. The dimensionality reduction component is carried out via deep autoencoders and the latent density is modeled using a non-parametric low-rank tensor model in the Fourier domain.

However, for some applications such as problems involving hybrid random variables (continuous and discrete) or tasks requiring multivariate integration of the high-dimensional PDF (e.g., corresponding to finite or semi-infinite ``box'' events), modeling the joint cumulative distribution function (CDF) seems more appropriate. The second part of the dissertation introduces a novel parametrization of grid-sampled multivariate CDFs using tensor factorization in the data or in the copula domain. Furthermore, the abundance of time-series data in machine learning systems such as sensor signals collected from wrist-worn devices, brings new demands in density estimation, as it requires estimating time-varying and high-dimensional distributions, capable of accurately modeling both inter- and intra-series dependencies. Our proposed model combines the temporal modeling power of recurrent neural networks (RNNs) with the parsimony of principled low-rank density models, to propose a versatile time series high-dimensional distribution model. As a practical demonstration of the abilities of our models, we applied them to various machine learning tasks, showcasing a highly competitive performance with respect to the state-of-the-art. In particular, we apply the proposed low-rank CDF method to predict enrollment rate (ER) in clinical trials, and we demonstrate significant improvements over previous methods used in the health informatics industry. Accurate ER prediction is key for successful and timely clinical trials, and has potential for strong societal impact in human health.

PHD (Doctor of Philosophy)
probabilistic modeling, low-rank tensor models, identifiable models, CPD decomposition, probabilistic inference, density estimation, generative models
All rights reserved (no additional license for public reuse)
Issued Date: