Advancing Reproducibility in Environmental Modeling: Integration of Open Repositories, Process Containerizations, and Seamless Workflows
Choi, Young Don, Civil Engineering - School of Engineering and Applied Science, University of Virginia
Goodall, Jonathan, EN-Eng Sys and Environment, University of Virginia
There is growing acknowledgment and awareness of the reproducibility challenge facing computational environmental modeling. To overcome this challenge, data sharing using open, online repositories that meet the FAIR (Findable, Accessible, Interoperable, and Reusable) guiding principles is recognized as a minimum standard to reproduce computational research. Even with these data sharing guidelines and well-documented workflows, it remains challenging to reproduce computational models due to complexities like inconsistent computational environments or difficulties in dealing with large datasets that prevent seamless, end-to-end modeling. Containerization technologies have been put forward as a means for addressing these problems by encapsulating computational environments, yet domain science researchers are often unclear about which containerization approach and technology is best for achieving a given modeling objective. Thus, to meet FAIR principles, researchers need clear guidelines for encapsulating seamless modeling workflows, especially for environmental modeling use cases that require large datasets. Toward these aims, this dissertation presents three studies to address current limitations of reproducibility in environmental modeling. The first study presents a framework for integrating three key components to improve reproducibility within modern computational environmental modeling: 1) online repositories for data and model sharing, 2) computational environments along with containerization technology and Jupyter notebooks for capturing reproducible modeling workflows, and 3) Application Programming Interfaces (APIs) for intuitive programmatic control of simulation models. The second study focuses on approaches for containerizing computational processes and suggests best practices and guidance for which approach is most appropriate to achieve specific modeling objectives when simulating environmental systems. The third study focuses on open and reproducible seamless environmental modeling workflows, especially when creating and sharing interoperable and reusable large-extent spatial datasets as model input. Key research contributions across these three studies are as follows. 1) Integration of online repositories for data and model sharing, computational environments along with containerization technology for capturing software dependencies, and workflows using model APIs and notebooks for model simulations creates a powerful system more open and reproducible environmental modeling. 2) Considering the needs and purposes of research and educational projects, and applying the appropriate containerization approach for each use case, makes computational research more reliable and efficient. 3) Sharing interoperable and reusable large-extent spatial datasets through open data repositories for model input supports seamless environmental modeling where data and processes can be reused across multiple applications. Finally, the methods developed and insights gained in this dissertation not only advance reliable and efficient computational reproducibility in environmental modeling, but also serve as best practices and guidance for achieving reproducibility in engineering practice and other scientific fields that rely on computational modeling.
PHD (Doctor of Philosophy)
Reproducibility, Open Hydrology