NHGF STAC Catalog - CONUS404 Example using UserCatData

NHGF STAC Catalog - CONUS404 Example using UserCatData#

In this example, we use a SpatioTemporal Asset Catalog (STAC) developed by Amelia Synder [Snyder, 2024] to access a gridded dataset of interest. This catalog provides a structured way to store and access metadata for various geospatial datasets, making it easier to retrieve and process data with gdptools.

The STAC includes multiple JSON files formatted according to the STAC specification, which can be accessed using the Python library PyStac. With this library, users can:

Read the catalog structure
Select different datasets
Locate metadata and endpoints for accessing gridded data

In this example, we aggregate CONUS404 Precipitable Water data for the Delaware River Basin [Rasmussen et al., 2023].

STAC Catalog#

The STAC catalog consists of a root JSON file, which acts as an index listing all available datasets. Each dataset entry points to a specific JSON file that contains metadata, access endpoints, and attributes for the corresponding gridded dataset.

Using PyStac, we can load the root catalog file and explore its dataset listings.

GDPtools#

gdptools is a Python package designed for computing area-weighted statistics for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.

Key features of gdptools include:

Methods for incorporating JSON-based metadata and grid entries for streamlined processing
Support for xarray-compatible datasets, even when JSON metadata is not available (though users will need to manually define some parameters)

This notebook demonstrates how to use gdptools to process CONUS404 data efficiently.

NLDI Integration#

We also leverage the U.S. Geological Survey’s Hydro Network-Linked Data Index (NLDI) [Blodgett, 2022] using pynhd, a Python client from the HyRiver package suite [Chegini et al., 2021]. NLDI provides hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.

References#

[Blo22]

David Blodgett. Hydro-Network Linked Data Index. 2022. URL: https://waterdata.usgs.gov/blog/nldi-intro/.

[CLL21]

Taher Chegini, Hong-Yi Li, and L. Ruby Leung. Hyriver: hydroclimate data retriever. Journal of Open Source Software, 6(66):3175, 2021. URL: https://doi.org/10.21105/joss.03175, doi:10.21105/joss.03175.

[RCL+23]

R.M. Rasmussen, F. Chen, C.H. Liu, K. Ikeda, A. Prein, J. Kim, T. Schneider, A. Dai, D. Gochis, A. Dugger, Y. Zhang, A. Jaye, J. Dudhia, and C. He. Conus404: the ncar–usgs 4-km long-term regional hydroclimate reanalysis over the conus. Bulletin of the American Meteorological Society, 104(8):E1382, 2023. URL: https://doi.org/10.5066/P9PHPK4F, doi:doi:10.5066/P9PHPK4F.

[Sny24]

Amelia Snyder. NHGF STAC. 2024. URL: https://code.usgs.gov/wma/nhgf/stac/.

import warnings

import hvplot.pandas
import hvplot.xarray
import numpy as np
import pystac
import xarray as xr
from pynhd import NLDI, WaterData
from pygeohydro import watershed

from gdptools import AggGen, UserCatData, WeightGen

warnings.filterwarnings("ignore")

Use HyRiver pynhd package to find the HUC12 basins comprising the Delaware River upstream of gage # 01482100 at Delaware Memorial Bridge

# Delaware River Basin HUC12s
# --------------------------------
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins["huc12"].str.startswith(("020401", "020402"))]
huc12_basins.head()

from holoviews.element.tiles import EsriTerrain

drb = huc12_basins.hvplot(
    geo=True, coastline='50m', alpha=0.2, c='r', frame_width=300,
    xlabel="longitude", ylabel="latitude", clabel="monthly water evaporation (mm)",
    title="Delaware River HUC12 basins", xlim=(-76.8, -73.8), aspect='equal'
)
EsriTerrain() * drb

Read in the datasets available in STAC#

# Read in the NHGF STAC catalog
cat = pystac.read_file("https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/")

Access the CONUS404 dataset#

# Choose a collection from the catalog
collection = cat.get_child("conus404-daily")

# List the assets
collection.assets

# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
    store=asset.href,
    storage_options=asset.extra_fields["xarray:storage_options"]
)

# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds

Initialize a gpdtools UserCatData class#

To do this, some attribute from the gridded and vector dataset need to be defined.

## CONUS404 crs:
# c404_proj = "+proj=lcc +lat_0=39.1000061035156 +lon_0=-97.9000015258789 +lat_1=30 +lat_2=50 +x_0=0 +y_0=0 +R=6370000 +units=m +no_defs"
c404_proj = ds.crs.attrs['crs_wkt']

data_crs = c404_proj
x_coord = "x"
y_coord = "y"
t_coord = "time"
sdate = "1999-01-01"
edate = "1999-01-07"
var = ["PWAT"]
shp_crs = 4326
shp_poly_idx = "huc12"
wght_gen_crs = 6931

user_data = UserCatData(
    source_ds=ds,
    source_crs=data_crs,
    source_x_coord=x_coord,
    source_y_coord=y_coord,
    source_t_coord=t_coord,
    source_var=var,
    target_gdf=huc12_basins,
    target_crs=shp_crs,
    target_id=shp_poly_idx,
    source_time_period=[sdate, edate]
)

Calculate the weights#

The WeightGen class finds the gridded data cells which intersect/overlay the polygons and determines which portions of every cell belong to every polygon.

wght_gen = WeightGen(
    user_data=user_data,
    method="serial",
    weight_gen_crs=6931,
)

wdf = wght_gen.calculate_weights()

Calculate the weight values#

The AggGen class uses the weights from WeightGen to calculate the values of the gridded data within each polygon geometry.

agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="serial",
    agg_writer="none",
    weights=wdf
)
ngdf, ds_out = agg_gen.calculate_agg()

Prep the data for display#

# Get a subset of the gridded data for displaying
ds_subset = user_data.get_source_subset('PWAT')

var = "PWAT"
time_step = "1999-01-07"
tmin = np.nanmin(ds_subset.sel(time=time_step).values)
tmax = np.nanmax(ds_subset.sel(time=time_step).values)
print(tmin,tmax)

# Create and display some data plots

conus_precip_raw = ds_subset.sel(time=time_step).hvplot.quadmesh(
    'lon', 'lat', 'PWAT', cmap='viridis',  alpha=1, grid=True, geo=True, coastline='50m',
    frame_width=300, xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="CONUS404 Precipitable Water", aspect='equal', clim=(tmin, tmax)
)

ngdf[var] = ds_out[var].sel(time=time_step)
conus_precip_agg = ngdf.to_crs(4326).hvplot(
    c='PWAT', geo=True, coastline='50m', frame_width=300, frame_height=450, alpha=1, cmap='viridis',
    clim=(tmin, tmax), xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="HUC12 area-weighted average", aspect='equal'
)

(EsriTerrain() * conus_precip_raw )  + (EsriTerrain() * conus_precip_agg)