Probabilistic Modeling and Analysis Methods for Large-Scale Data Compression

Author: ORCID icon
Lou, Hao, Electrical Engineering - School of Engineering and Applied Science, University of Virginia
Farnoud, Farzad, EN-Elec & Comp Engr Dept, University of Virginia

As a result of the rapid increase in the size of digital data, addressing current data storage needs and provisioning for future growth have become challenging tasks. Data stored in computing systems however is redundant, with blocks that are repeated one or more times, such as an image embedded in multiple documents. Data deduplication, which eliminates redundant data at the file or subfile level, has gained increasing attention and popularity in large-scale storage systems.

However, there are significant shortcomings in the current approaches to deduplication, which are mainly due to the lack of a mathematical framework and analytical methods. In particular, our knowledge of the fundamental limits of data deduplication is severely limited. Information-theoretic analysis of data deduplication has only been considered by a limited number of works. Moreover, in these works, deduplication algorithms are evaluated over information sources with known statistical properties. While in reality, given a particular data sequence, we rarely know the specific distribution of the source producing it. In this dissertation, we study data deduplication theoretically by developing and analyzing different models for information sources and various deduplication algorithms as well as their performance evaluated over source models. Moreover, we extend the study of data deduplication to the context of universal compression, where we assume only a family of distributions to which the source distribution belongs is known, by drawing equivalence between encoding chunks and the problem of encoding pattern sequences.

One of the significant contributors to the increase in the size of digital data is genomic sequencing data, which is growing at a rate much faster than the decrease in the cost of storage media due to development of high-throughput sequencing methods. Hence, developing compression methods tailored to genomic data can be effective in addressing data storage challenges. Developing and leveraging models for biological processes that generate this type of data are also crucial steps in this direction. Biological sequences are created over billions of years by mutation processes that shape their statistical properties. In this dissertation, we present a framework of modeling biological sequences as outcomes of such evolutionary processes driven by random mutations. The statistics $k$-mer frequencies are studied both asymptotically and in finite-time. Based on these findings, the entropy of evolutionary processes is bounded, thus providing an lower bound for compression tasks. Additionally, estimates for waiting times problems, with potential applications to study of diseases such as cancer, are derived.

PHD (Doctor of Philosophy)
Data compression, Data deduplication, Information theory, Probabilistic modeling
Issued Date: