Error-Correcting Codes for Data Storage in DNA
Tang, Yuanyuan, Electrical Engineering - School of Engineering and Applied Science, University of Virginia
Farnoud, Farzad, EN-Elec & Comp Engr Dept, University of Virginia
Recent advances in DNA sequencing, synthesis, and editing technologies have made DNA a promising alternative to conventional storage media. Compared to traditional media, DNA has the advantages of high data density, longevity, and ease of copying information. However, a diverse set of errors, including deletions, insertions, substitutions, and duplications may arise at different stages of the data storage and retrieval process. The dissertation focuses on constructing error-correcting codes to correct multiple sources of error in DNA data storage.
First, duplications and various types of edit errors (including insertions, deletions, and substitutions) may affect the integrity of data stored in DNA. While recent research has addressed correcting each of these error types in isolation, there is a notable gap in correcting them simultaneously, to the best of our knowledge. This is problematic as the presence of one type of error does not preclude another, especially in long sequences. In this dissertation, we present families of codes that simultaneously correct any number of duplications and a bounded number of edit errors, with small impact on code rate compared to the state-of-the-art codes for duplications only.
Second, to correct various types of localized errors, we construct error-correcting codes for substring edit errors. A localized error occurs when several errors occur in a bounded window of the input data. In the literature, localized deletions, insertions, and substitutions have been studied. A substring edit, defined as replacing a substring with another string, both with bounded length, encompasses all aforementioned localized errors. In this work, we first show using statistical analysis of errors that substring edits are common and viewing errors as substring edits in principle will require less redundancy. Then error-correcting codes are constructed to correct a substring edit with low redundancy, providing a universal solution to correct a diverse set of localized errors.
Third, we develop codes for improving the reliability of an emerging DNA synthesis method that due its cost-effectiveness can alleviate one of the main obstacles for wide spread use of data storage in DNA, namely terminator-free template-independent enzymatic DNA synthesis (simplified as enzymatic synthesis). While more economical, enzymatic synthesis suffers from a higher error rate. Specifically, the number of times each base is synthesized cannot be controlled precisely and also deletion errors are likely. Existing encoding methods have either a writing rate upper bounded by $\log_2 3$ bits per unit time or cannot combat against deletions. In this dissertation, we present an error-correcting code and a decoding algorithm to combat deletions and achieve a writing rate higher than the state of the art. The error probability of the proposed method is analyzed and the strategies for tuning the parameters to achieve desirable tradeoffs between different requirements are presented.
Together, all the error correction algorithms, tools, and techniques for duplications and edits, substring edits, and the errors from the enzymatic synthesis presented in this work have the potential to contribute to the development of DNA data storage systems with increased capacity, higher reliability, and lower cost.
PHD (Doctor of Philosophy)
error-correcting codes, edit errors, duplications, substring edits, noisy run lengths, low redundancy, statistical analysis, data storage in DNA
All rights reserved (no additional license for public reuse)