Advancing Open and Reproducible Water Data Science by Integrating Data
Analytics with an Online Data Repository
Abstract
Scientific and related management challenges in the water domain require
synthesis of data from multiple domains. Many data analysis tasks are
difficult because datasets are large and complex; standard formats for
data types are not always agreed upon nor mapped to an efficient
structure for analysis; water scientists may lack training in methods
needed to efficiently tackle large and complex datasets; and available
tools can make it difficult to share, collaborate around, and reproduce
scientific work. Overcoming these barriers to accessing, organizing, and
preparing datasets for analyses will be an enabler for transforming
scientific inquiries. Building on the HydroShare repository’s
established cyberinfrastructure, we have advanced two packages for the
Python language that make data loading, organization, and curation for
analysis easier, reducing time spent in choosing appropriate data
structures and writing code to ingest data. These packages enable
automated retrieval of data from HydroShare and the USGS’s National
Water Information System (NWIS), loading of data into performant
structures keyed to specific scientific data types and that integrate
with existing visualization, analysis, and data science capabilities
available in Python, and then writing analysis results back to
HydroShare for sharing and eventual publication. These capabilities
reduce the technical burden for scientists associated with creating a
computational environment for executing analyses by installing and
maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub
server. HydroShare users can leverage these tools to build, share, and
publish more reproducible scientific workflows. The HydroShare Python
Client and USGS NWIS Data Retrieval packages can be installed within a
Python environment on any computer running Microsoft Windows, Apple
MacOS, or Linux from the Python Package Index using the PIP utility.
They can also be used online via the CUAHSI JupyterHub server
(https://jupyterhub.cuahsi.org/) or other Python notebook environments
like Google Collaboratory (https://colab.research.google.com/). Source
code, documentation, and examples for the software are freely available
in GitHub at https://github.com/hydroshare/hsclient/ and
https://github.com/USGS-python/dataretrieval.