NHGF STAC Catalog - CONUS404 Aggregation Example#
In this example, a SpatioTemporal Asset Catalog (STAC) developed by Amelia Synder [Snyder, 2024] is used to access a gridded dataset of interest. This STAC catalog is an evolving collection of gridded data sources, along with the necessary metadata for processing with gdptools.
The catalog consists of multiple STAC-formatted JSON files, which can be accessed using the Python library pystac. With this library, users can:
Read the catalog
Select different datasets
Locate metadata and endpoints for accessing gridded data
In this example, we aggregate CONUS404 Precipitable Water data for the Delaware River Basin [Rasmussen et al., 2023].
STAC Catalog#
The STAC catalog is structured around a root JSON file, which serves as an index listing all available datasets. Each dataset has a corresponding JSON file containing its metadata, access endpoint, and attributes.
Using pystac, we can load the root catalog file and explore the dataset listings.
GDPtools#
gdptools is a Python package designed for computing area-weighted statistics for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.
Key Features:#
Built-in methods for incorporating JSON metadata and grid entries for streamlined processing
Support for xarray-compatible datasets, even if JSON metadata is unavailable (users will need to manually define parameters)
This notebook demonstrates how gdptools can be used to process CONUS404 data efficiently.
NLDI Integration#
We also use the U.S. Geological Survey’s Hydro Network-Linked Data Index (NLDI) [Blodgett, 2022], accessed via pynhd, a Python client from the HyRiver package suite [Chegini et al., 2021].
NLDI provides tools for hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.
References#
David Blodgett. Hydro-Network Linked Data Index. 2022. URL: https://waterdata.usgs.gov/blog/nldi-intro/.
Taher Chegini, Hong-Yi Li, and L. Ruby Leung. Hyriver: hydroclimate data retriever. Journal of Open Source Software, 6(66):3175, 2021. URL: https://doi.org/10.21105/joss.03175, doi:10.21105/joss.03175.
R.M. Rasmussen, F. Chen, C.H. Liu, K. Ikeda, A. Prein, J. Kim, T. Schneider, A. Dai, D. Gochis, A. Dugger, Y. Zhang, A. Jaye, J. Dudhia, and C. He. Conus404: the ncar–usgs 4-km long-term regional hydroclimate reanalysis over the conus. Bulletin of the American Meteorological Society, 104(8):E1382, 2023. URL: https://doi.org/10.5066/P9PHPK4F, doi:doi:10.5066/P9PHPK4F.
Amelia Snyder. NHGF STAC. 2024. URL: https://code.usgs.gov/wma/nhgf/stac/.
import warnings
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pystac
import xarray as xr
from pygeohydro import watershed
from pynhd import NLDI, WaterData
from gdptools import AggGen, NHGFStacData, WeightGen
from gdptools.helpers import get_stac_collection
warnings.filterwarnings("ignore")
Load the Target Polygons#
Before accessing gridded data, we need polygons to aggregate the data onto. For this example, we use the HUC12 basins within the Delaware River Basin.
We retrieve these polygons using pygeohydro.watershed to query the Watershed Boundary Dataset (WBD) via the Hydro Network Linked Data Index (NLDI). We filter to HUC12 basins whose IDs start with 020401 or 020402 (the Delaware Basin).
# Delaware River Basin HUC12s
# --------------------------------
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins["huc12"].str.startswith(("020401", "020402"))]
huc12_basins.head()
from holoviews.element.tiles import EsriTerrain
drb = huc12_basins.hvplot(
geo=True, coastline='50m', alpha=0.2, c='r', frame_width=300,
xlabel="longitude", ylabel="latitude", clabel="monthly water evaporation (mm)",
title="Delaware River HUC12 basins", xlim=(-76.8, -73.8), aspect='equal'
)
EsriTerrain() * drb
Search the STAC Catalog for a Data Collection#
The STAC catalog is hosted at: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/. You can view the catalog in a web browser or use the pystac library to access it programmatically.
Important: The catalog is hierarchical—some collections are container collections that only group related datasets (e.g., conus404), while others are data collections that contain actual data assets (e.g., conus404_daily, conus404_hourly). When using gdptools, you must select a data collection, not a container collection.
The catalog contains two types of data collections:
Zarr collections (e.g.,
conus404_daily) – identified by azarr-s3-*asset, containing gridded time-series data dimensioned as(time, y, x).GeoTIFF collections (e.g.,
nlcd-LndCov) – identified bytif-s3-*assets at the item level, with one item per time step (e.g., one TIFF per year for NLCD land cover).
NHGFStacData auto-detects the collection format and returns the appropriate handler. See the NHGFStacData_NLCD example for a GeoTIFF workflow.
We provide a helper function gdptools.helpers.get_stac_collection() that searches the catalog recursively, so you can request nested collections directly by their ID (e.g., "conus404_daily") without manually traversing the hierarchy.
Note: For Zarr collections, variables must be dimensioned as (time, y, x). GeoTIFF collections support single-item (e.g., single-year) access for zonal statistics.
collection_id = "conus404_daily"
collection = get_stac_collection(collection_id)
print(collection.assets)
Open the Dataset as an xarray Dataset#
Once we have the STAC collection, we can open the zarr asset directly with xarray.open_zarr(). The collection’s assets dictionary contains the zarr endpoint and any required storage options.
Browse the Data Variables in the returned Dataset to choose which variables to aggregate onto the HUC polygons.
# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
asset.href,
storage_options=asset.extra_fields["xarray:storage_options"],
)
# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds
Initialize an NHGFStacData Object#
The NHGFStacData class wraps the STAC collection and target polygons into a single object that gdptools uses for weight generation and aggregation.
Key parameters:
source_collection– the pystac Collection object retrieved abovesource_var– list of variable names to aggregate (must be dimensioned as(time, y, x))target_gdf– GeoDataFrame of polygons to aggregate ontotarget_id– column intarget_gdfthat uniquely identifies each polygonsource_time_period–[start_date, end_date]defining the temporal subset
sdate = "1999-01-01" # Start date of the time dimension of the gridded data
edate = "1999-01-07" # End data
var = ["PWAT"] # Gridded data to query
shp_poly_idx = "huc12" # Column from the HUC geodataframe used to distingiush between polygons
user_data = NHGFStacData(
source_collection=collection,
source_var=var,
target_gdf=huc12_basins,
target_id=shp_poly_idx,
source_time_period=[sdate, edate]
)
Calculate Intersection Weights#
The WeightGen class computes how each grid cell intersects the target polygons and assigns area-based weights to each cell–polygon pair.
Key parameters:
user_data– theNHGFStacDataobjectmethod–"serial","parallel", or"dask"for different execution backendsweight_gen_crs– an equal-area CRS (e.g., EPSG:6931) for accurate area calculations
The returned DataFrame contains columns for polygon ID, grid cell indices, and the fractional weight of each cell within each polygon.
wght_gen = WeightGen(
user_data=user_data,
method="serial",
weight_gen_crs=6931,
)
wdf = wght_gen.calculate_weights()
Aggregate Gridded Data to Polygons#
The AggGen class applies the intersection weights to compute area-weighted statistics for each polygon and time step.
Key parameters:
stat_method– aggregation statistic (e.g.,"masked_mean","median","sum")agg_engine–"serial","parallel", or"dask"agg_writer– output format ("none","csv","netcdf","zarr")weights– the DataFrame returned byWeightGen.calculate_weights()
Returns a tuple of (GeoDataFrame, xarray.Dataset) containing the aggregated values.
agg_gen = AggGen(
user_data=user_data,
stat_method="masked_mean",
agg_engine="serial",
agg_writer="none",
weights=wdf
)
ngdf, ds_out = agg_gen.calculate_agg()
Visualize the Results#
Finally, we compare the original gridded data with the polygon-aggregated values side by side.
We extract a spatial subset of the raw CONUS404 data for display, then plot both the raw grid and the HUC12 area-weighted averages using hvplot.
# Get a subset of the gridded data for displaying
ds_subset = user_data.get_source_subset('PWAT')
var = "PWAT"
time_step = "1999-01-07"
tmin = np.nanmin(ds_subset.sel(time=time_step).values)
tmax = np.nanmax(ds_subset.sel(time=time_step).values)
print(tmin,tmax)
from holoviews.element.tiles import EsriTerrain
# Create and display some data plots
conus_precip_raw = ds_subset.sel(time=time_step).hvplot.quadmesh(
'lon', 'lat', 'PWAT', cmap='viridis', alpha=1, grid=True, geo=True, coastline='50m',
frame_width=300, xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
title="CONUS404 Precipitable Water", aspect='equal', clim=(tmin, tmax)
)
ngdf[var] = ds_out[var].sel(time=time_step)
conus_precip_agg = ngdf.to_crs(4326).hvplot(
c='PWAT', geo=True, coastline='50m', frame_width=300, frame_height=450, alpha=1, cmap='viridis',
clim=(tmin, tmax), xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
title="HUC12 area-weighted average", aspect='equal'
)
(EsriTerrain() * conus_precip_raw ) + (EsriTerrain() * conus_precip_agg)