NHGF STAC Catalog - CONUS404 Grid-to-Line Example#
In this example, a SpatioTemporal Asset Catalog (STAC) developed by Amelia Synder [Snyder, 2024] is used to access a gridded dataset of interest. This STAC catalog is an evolving collection of gridded data sources, along with the necessary metadata for processing with gdptools.
The catalog consists of multiple STAC-formatted JSON files, which can be accessed using the Python library pystac. With this library, users can:
Read the catalog
Select different datasets
Locate metadata and endpoints for accessing gridded data
In this example, we aggregate CONUS404 Precipitable Water data for the Delaware River Basin [Rasmussen et al., 2023].
STAC Catalog#
The STAC catalog is structured around a root JSON file, which serves as an index listing all available datasets. Each dataset has a corresponding JSON file containing its metadata, access endpoint, and attributes.
Using pystac, we can load the root catalog file and explore the dataset listings.
GDPtools#
gdptools is a Python package designed for computing area-weighted statistics for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.
Key Features:#
Built-in methods for incorporating JSON metadata and grid entries for streamlined processing
Support for xarray-compatible datasets, even if JSON metadata is unavailable (users will need to manually define parameters)
This notebook demonstrates how gdptools can be used to process CONUS404 data efficiently.
NLDI Integration#
We also use the U.S. Geological Survey’s Hydro Network-Linked Data Index (NLDI) [Blodgett, 2022], accessed via pynhd, a Python client from the HyRiver package suite [Chegini et al., 2021].
NLDI provides tools for hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.
References#
David Blodgett. Hydro-Network Linked Data Index. 2022. URL: https://waterdata.usgs.gov/blog/nldi-intro/.
Taher Chegini, Hong-Yi Li, and L. Ruby Leung. Hyriver: hydroclimate data retriever. Journal of Open Source Software, 6(66):3175, 2021. URL: https://doi.org/10.21105/joss.03175, doi:10.21105/joss.03175.
R.M. Rasmussen, F. Chen, C.H. Liu, K. Ikeda, A. Prein, J. Kim, T. Schneider, A. Dai, D. Gochis, A. Dugger, Y. Zhang, A. Jaye, J. Dudhia, and C. He. Conus404: the ncar–usgs 4-km long-term regional hydroclimate reanalysis over the conus. Bulletin of the American Meteorological Society, 104(8):E1382, 2023. URL: https://doi.org/10.5066/P9PHPK4F, doi:doi:10.5066/P9PHPK4F.
Amelia Snyder. NHGF STAC. 2024. URL: https://code.usgs.gov/wma/nhgf/stac/.
import warnings
import geopandas as gpd
import hvplot.pandas
import hvplot.xarray
import pystac
import xarray as xr
from gdptools import InterpGen, NHGFStacData
from gdptools.helpers import get_stac_collection
warnings.filterwarnings("ignore")
Read in a shapefile of lines to interpolate grid data#
gdf = gpd.read_file('../ClimateR-Catalog/test_lines/test_lines_4326.shp')
Search the STAC Catalog for a Data Collection#
The STAC catalog is hosted at: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/. You can view the catalog in a web browser or use the pystac library to access it programmatically.
Important: The catalog is hierarchical—some collections are container collections that only group related datasets (e.g., conus404), while others are data collections that contain actual data assets (e.g., conus404_daily, conus404_hourly). When using gdptools, you must select a data collection, not a container collection.
The catalog contains two types of data collections:
Zarr collections (e.g.,
conus404_daily) – identified by azarr-s3-*asset, containing gridded time-series data dimensioned as(time, y, x).GeoTIFF collections (e.g.,
nlcd-LndCov) – identified bytif-s3-*assets at the item level, with one item per time step (e.g., one TIFF per year for NLCD land cover).
NHGFStacData auto-detects the collection format and returns the appropriate handler. See the NHGFStacData_NLCD example for a GeoTIFF workflow.
We provide a helper function gdptools.helpers.get_stac_collection() that searches the catalog recursively, so you can request nested collections directly by their ID (e.g., "conus404_daily") without manually traversing the hierarchy.
Note: For Zarr collections, variables must be dimensioned as (time, y, x). GeoTIFF collections support single-item (e.g., single-year) access for zonal statistics. Grid-to-line interpolation (this notebook) currently applies to Zarr collections only.
collection_id = "conus404_daily"
collection = get_stac_collection(collection_id)
print(collection.assets)
Open the Dataset as an xarray Dataset#
Once we have the STAC collection, we can open the zarr asset directly with xarray.open_zarr(). The collection’s assets dictionary contains the zarr endpoint and any required storage options.
Browse the Data Variables in the returned Dataset to choose which variables to aggregate onto the HUC polygons.
Read the dataset in as as a Xarray Dateset#
# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
store=asset.href,
storage_options=asset.extra_fields["xarray:storage_options"]
)
# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds
Initialize an NHGFStacData Object#
The NHGFStacData class wraps the STAC collection and target polygons into a single object that gdptools uses for weight generation and aggregation.
Key parameters:
source_collection– the pystac Collection object retrieved abovesource_var– list of variable names to aggregate (must be dimensioned as(time, y, x))target_gdf– GeoDataFrame of polygons to aggregate ontotarget_id– column intarget_gdfthat uniquely identifies each polygonsource_time_period–[start_date, end_date]defining the temporal subset
sdate = "1999-01-01" # Start date of the time dimension of the gridded data
edate = "1999-01-07" # End data
var = ["PWAT"] # Gridded data to query
shp_line_idx = "permanent_" # Column from the geodataframe used to distingiush between lines
user_data = NHGFStacData(
source_collection=collection,
source_var=var,
target_gdf=gdf,
target_id=shp_line_idx,
source_time_period=[sdate, edate]
)
Extract Grid Values Along Line Geometries#
The InterpGen class interpolates gridded data values at evenly spaced points along line geometries and computes summary statistics for each line.
Key parameters:
user_data– theNHGFStacData(or otherUserData) object containing the source grid and target linespt_spacing– distance (in CRS units) between interpolated points along each linestat– statistic(s) to compute per line ("mean","min","max","std", or"all")
Returns a tuple of two DataFrames:
Line statistics – summary statistics (mean, min, max, etc.) for each line geometry and time step
Point values – the interpolated grid values at each sample point along the lines
# InterpGen object
interp_object = InterpGen(
user_data,
pt_spacing=100,
stat='all'
)
# calculate the stats and interpolated points
stats, pts = interp_object.calc_interp()
stats
stats[ (stats["date"] == "1999-01-02") & (stats["permanent_"] == "155842105") & (stats["varname"] == "PWAT")]
pts
pts[(pts["permanent_"] == "155842105") & (pts["varname"] == var[0])]