Advancing Cyberinfrastructure for Reproducible Hydrologic Modeling

Author: ORCID icon orcid.org/0000-0003-2295-7981
Essawy, Bakinam, Civil Engineering - School of Engineering and Applied Science, University of Virginia
Advisor:
Goodall, Jonathan, Civil & Env Engr, University of Virginia
Abstract:

Scientists have created a significant and growing collection of software tools for data manipulation, analysis, and simulation. This software includes not only computational models, but also a large collection of data pre- and post-processing tools used to support computational modeling and data analysis. This large and diverse set of scientific modeling software presents challenges to the hydrologic science community including (1) the difficulty of making scientific workflows reproducible in support of an end-to-end hydrologic modeling use case, especially when large dataset transfers are required for data processing pipelines, (2) providing adequate and accessible component-level metadata for legacy hydrologic software, and (3) ensuring scientific reproducibility when using legacy hydrologic processing software, which often has complex software dependencies. This research uses the Variable Infiltration Capacity (VIC) and MODFLOW hydrologic models as use cases in examining these challenges.
This research addresses these challenges by conducting three studies. The first study explores approaches for leveraging data grid technology in hydrologic modeling to support reproducible workflows using large datasets. Its primary contribution is a general methodology for analyzing large, distributed data collections. This is accomplished by moving processing resources to the large datasets in contrast to the typical approach of moving the datasets to the processing resources. Data grid technology is used to automate data transfers and staging, in combination with automated formal publication of generated data assets.
The second study advances prior efforts for formalizing model metadata in hydrology by evaluating the OntoSoft Ontology as a means for formally structuring model metadata for lower level scientific software components. The metadata evaluated describes a data pre-processing workflow for the VIC hydrologic model. This workflow consists of multiple software components written by different scientists over time. The analysis begins by exploring what metadata hydrologists have already captured in unstructured forms. It then shows how this metadata could be organized into structured, machine-readable metadata using the OntoSoft Ontology.
Finally, the third study explores the creation of containers using Docker to more easily execute hydrologic modeling software in a computing environment. By containerizing a model with all of its dependencies, the model is self-contained and portable. This work contributes a methodology for using HydroShare and Geotrust, two new cyberinfrastructure tools under active development, to improve reproducibility in computational hydrology. HydroShare is a web-based system for sharing hydrologic data and model resources. GeoTrust allows scientists to document their computational workflows as containers called sciunits, a type of Docker container. sciunits include required software dependencies making execution of computational workflows more consistent across computing environments. HydroShare and GeoTrust can be used together to create open, reusable data analysis and model execution services. The services are created using GeoTrust and can be integrated with HydroShare as Web apps that operate on HydroShare resources. The MODLFOW groundwater model is used as an example to show the functionality provided by this cyberinfrastructure for creating open and reusable data analysis and model execution services.

Degree:
PHD (Doctor of Philosophy)
Keywords:
Hydrologic modeling, Workflows, Computational reproducibility, Federation, iRODS, Scientific workflows, Metadata, MODFLOW
Language:
English
Issued Date:
2017/07/18