---
jupyter:
  jupytext:
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.18.1
  kernelspec:
    display_name: Python 3 (ipykernel)
    language: python
    name: python3
---

# NHGF STAC Catalog - CONUS404 Aggregation Example

In this example, a **SpatioTemporal Asset Catalog (STAC)** developed by Amelia Synder {cite:p}`stac_catalog` is used to access a gridded dataset of interest. This STAC catalog is an evolving collection of gridded data sources, along with the necessary metadata for processing with `gdptools`.

The catalog consists of multiple **STAC-formatted JSON files**, which can be accessed using the Python library **pystac**. With this library, users can:

- Read the catalog
- Select different datasets
- Locate metadata and endpoints for accessing gridded data

In this example, we aggregate **CONUS404 Precipitable Water** data for the **Delaware River Basin** {cite:p}`conus404_2023`.

## STAC Catalog

The STAC catalog is structured around a **root JSON file**, which serves as an index listing all available datasets. Each dataset has a corresponding JSON file containing its metadata, access endpoint, and attributes.

Using **pystac**, we can load the root catalog file and explore the dataset listings.

## GDPtools

`gdptools` is a Python package designed for computing **area-weighted statistics** for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.

### Key Features:

- Built-in methods for incorporating JSON metadata and grid entries for streamlined processing
- Support for **xarray-compatible** datasets, even if JSON metadata is unavailable (users will need to manually define parameters)

This notebook demonstrates how `gdptools` can be used to process CONUS404 data efficiently.

## NLDI Integration

We also use the U.S. Geological Survey’s **Hydro Network-Linked Data Index (NLDI)** {cite:p}`nldi`, accessed via `pynhd`, a Python client from the **HyRiver** package suite {cite:p}`Chegini2021`.

NLDI provides tools for hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.

## References

```{bibliography}
:filter: docname in docnames
```

```python
import warnings

import hvplot.pandas
import hvplot.xarray
import numpy as np
import pystac
import xarray as xr
from pygeohydro import watershed
from pynhd import NLDI, WaterData

from gdptools import AggGen, NHGFStacData, WeightGen
from gdptools.helpers import get_stac_collection

warnings.filterwarnings("ignore")

```


## Load the Target Polygons

Before accessing gridded data, we need polygons to aggregate the data onto. For this example, we use the **HUC12 basins** within the Delaware River Basin.

We retrieve these polygons using `pygeohydro.watershed` to query the **Watershed Boundary Dataset (WBD)** via the Hydro Network Linked Data Index (NLDI). We filter to HUC12 basins whose IDs start with `020401` or `020402` (the Delaware Basin).

```python
# Delaware River Basin HUC12s
# --------------------------------
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins["huc12"].str.startswith(("020401", "020402"))]
huc12_basins.head()

```

```python
from holoviews.element.tiles import EsriTerrain

drb = huc12_basins.hvplot(
    geo=True, coastline='50m', alpha=0.2, c='r', frame_width=300,
    xlabel="longitude", ylabel="latitude", clabel="monthly water evaporation (mm)",
    title="Delaware River HUC12 basins", xlim=(-76.8, -73.8), aspect='equal'
)
EsriTerrain() * drb

```

### Search the STAC Catalog for a Data Collection

The STAC catalog is hosted at: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/. You can view the catalog in a web browser or use the `pystac` library to access it programmatically.

**Important:** The catalog is hierarchical—some collections are *container collections* that only group related datasets (e.g., `conus404`), while others are *data collections* that contain actual data assets (e.g., `conus404_daily`, `conus404_hourly`). When using `gdptools`, you must select a **data collection**, not a container collection.

The catalog contains two types of data collections:
- **Zarr collections** (e.g., `conus404_daily`) – identified by a `zarr-s3-*` asset, containing gridded time-series data dimensioned as `(time, y, x)`.
- **GeoTIFF collections** (e.g., `nlcd-LndCov`) – identified by `tif-s3-*` assets at the item level, with one item per time step (e.g., one TIFF per year for NLCD land cover).

`NHGFStacData` auto-detects the collection format and returns the appropriate handler. See the **NHGFStacData_NLCD** example for a GeoTIFF workflow.

We provide a helper function `gdptools.helpers.get_stac_collection()` that searches the catalog recursively, so you can request nested collections directly by their ID (e.g., `"conus404_daily"`) without manually traversing the hierarchy.

> **Note:** For Zarr collections, variables must be dimensioned as **(time, y, x)**. GeoTIFF collections support single-item (e.g., single-year) access for zonal statistics.

```python
collection_id = "conus404_daily"
collection = get_stac_collection(collection_id)
print(collection.assets)

```

### Open the Dataset as an xarray Dataset

Once we have the STAC collection, we can open the zarr asset directly with `xarray.open_zarr()`. The collection's `assets` dictionary contains the zarr endpoint and any required storage options.

Browse the **Data Variables** in the returned Dataset to choose which variables to aggregate onto the HUC polygons.

```python
# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
    asset.href,
    storage_options=asset.extra_fields["xarray:storage_options"],
)

# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds

```

### Initialize an NHGFStacData Object

The `NHGFStacData` class wraps the STAC collection and target polygons into a single object that `gdptools` uses for weight generation and aggregation.

**Key parameters:**
- `source_collection` – the pystac Collection object retrieved above
- `source_var` – list of variable names to aggregate (must be dimensioned as `(time, y, x)`)
- `target_gdf` – GeoDataFrame of polygons to aggregate onto
- `target_id` – column in `target_gdf` that uniquely identifies each polygon
- `source_time_period` – `[start_date, end_date]` defining the temporal subset

```python
sdate = "1999-01-01"      # Start date of the time dimension of the gridded data
edate = "1999-01-07"      # End data
var = ["PWAT"]            # Gridded data to query
shp_poly_idx = "huc12"    # Column from the HUC geodataframe used to distingiush between polygons

user_data = NHGFStacData(
    source_collection=collection,
    source_var=var,
    target_gdf=huc12_basins,
    target_id=shp_poly_idx,
    source_time_period=[sdate, edate]
)

```

### Calculate Intersection Weights

The `WeightGen` class computes how each grid cell intersects the target polygons and assigns **area-based weights** to each cell–polygon pair.

**Key parameters:**
- `user_data` – the `NHGFStacData` object
- `method` – `"serial"`, `"parallel"`, or `"dask"` for different execution backends
- `weight_gen_crs` – an equal-area CRS (e.g., EPSG:6931) for accurate area calculations

The returned DataFrame contains columns for polygon ID, grid cell indices, and the fractional weight of each cell within each polygon.

```python
wght_gen = WeightGen(
    user_data=user_data,
    method="serial",
    weight_gen_crs=6931,
)

wdf = wght_gen.calculate_weights()

```

### Aggregate Gridded Data to Polygons

The `AggGen` class applies the intersection weights to compute area-weighted statistics for each polygon and time step.

**Key parameters:**
- `stat_method` – aggregation statistic (e.g., `"masked_mean"`, `"median"`, `"sum"`)
- `agg_engine` – `"serial"`, `"parallel"`, or `"dask"`
- `agg_writer` – output format (`"none"`, `"csv"`, `"netcdf"`, `"zarr"`)
- `weights` – the DataFrame returned by `WeightGen.calculate_weights()`

Returns a tuple of `(GeoDataFrame, xarray.Dataset)` containing the aggregated values.

```python
agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="serial",
    agg_writer="none",
    weights=wdf
)
ngdf, ds_out = agg_gen.calculate_agg()

```

### Visualize the Results

Finally, we compare the **original gridded data** with the **polygon-aggregated values** side by side.

We extract a spatial subset of the raw CONUS404 data for display, then plot both the raw grid and the HUC12 area-weighted averages using `hvplot`.

```python
# Get a subset of the gridded data for displaying
ds_subset = user_data.get_source_subset('PWAT')

var = "PWAT"
time_step = "1999-01-07"
tmin = np.nanmin(ds_subset.sel(time=time_step).values)
tmax = np.nanmax(ds_subset.sel(time=time_step).values)
print(tmin,tmax)

```

```python
from holoviews.element.tiles import EsriTerrain

# Create and display some data plots

conus_precip_raw = ds_subset.sel(time=time_step).hvplot.quadmesh(
    'lon', 'lat', 'PWAT', cmap='viridis',  alpha=1, grid=True, geo=True, coastline='50m',
    frame_width=300, xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="CONUS404 Precipitable Water", aspect='equal', clim=(tmin, tmax)
)

ngdf[var] = ds_out[var].sel(time=time_step)
conus_precip_agg = ngdf.to_crs(4326).hvplot(
    c='PWAT', geo=True, coastline='50m', frame_width=300, frame_height=450, alpha=1, cmap='viridis',
    clim=(tmin, tmax), xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="HUC12 area-weighted average", aspect='equal'
)

(EsriTerrain() * conus_precip_raw )  + (EsriTerrain() * conus_precip_agg)

```