---
jupyter:
  jupytext:
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.18.1
  kernelspec:
    display_name: readthedocs
    language: python
    name: python3
---

# NHGF STAC Catalog - CONUS404 Example using UserCatData

In this example, we use a **SpatioTemporal Asset Catalog (STAC)** developed by Amelia Synder {cite:p}`stac_catalog` to access a gridded dataset of interest. This catalog provides a structured way to store and access metadata for various geospatial datasets, making it easier to retrieve and process data with `gdptools`.

The STAC includes multiple JSON files formatted according to the STAC specification, which can be accessed using the Python library **PyStac**. With this library, users can:

- Read the catalog structure
- Select different datasets
- Locate metadata and endpoints for accessing gridded data

In this example, we aggregate **CONUS404 Precipitable Water** data for the **Delaware River Basin** {cite:p}`conus404_2023`.

## STAC Catalog

The STAC catalog consists of a **root JSON file**, which acts as an index listing all available datasets. Each dataset entry points to a specific JSON file that contains metadata, access endpoints, and attributes for the corresponding gridded dataset.

Using **PyStac**, we can load the root catalog file and explore its dataset listings.

## GDPtools

`gdptools` is a Python package designed for computing **area-weighted statistics** for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.

Key features of `gdptools` include:

- Methods for incorporating JSON-based metadata and grid entries for streamlined processing
- Support for **xarray-compatible** datasets, even when JSON metadata is not available (though users will need to manually define some parameters)

This notebook demonstrates how to use `gdptools` to process CONUS404 data efficiently.

## NLDI Integration

We also leverage the U.S. Geological Survey’s **Hydro Network-Linked Data Index (NLDI)** {cite:p}`nldi` using `pynhd`, a Python client from the **HyRiver** package suite {cite:p}`Chegini2021`. NLDI provides hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.

## References

```{bibliography}
:filter: docname in docnames

```

```python
import warnings

import hvplot.pandas
import hvplot.xarray
import numpy as np
import pystac
import xarray as xr
from pynhd import NLDI, WaterData
from pygeohydro import watershed

from gdptools import AggGen, UserCatData, WeightGen

warnings.filterwarnings("ignore")
```

Use HyRiver pynhd package to find the HUC12 basins comprising the Delaware River upstream of gage # 01482100 at Delaware Memorial Bridge

```python
# Delaware River Basin HUC12s
# --------------------------------
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins["huc12"].str.startswith(("020401", "020402"))]
huc12_basins.head()
```

```python
from holoviews.element.tiles import EsriTerrain

drb = huc12_basins.hvplot(
    geo=True, coastline='50m', alpha=0.2, c='r', frame_width=300,
    xlabel="longitude", ylabel="latitude", clabel="monthly water evaporation (mm)",
    title="Delaware River HUC12 basins", xlim=(-76.8, -73.8), aspect='equal'
)
EsriTerrain() * drb
```

### Read in the datasets available in STAC

```python
# Read in the NHGF STAC catalog
cat = pystac.read_file("https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/")
```

### Access the CONUS404 dataset

```python
# Choose a collection from the catalog
collection = cat.get_child("conus404-daily")

# List the assets
collection.assets
```

```python
# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
    store=asset.href,
    storage_options=asset.extra_fields["xarray:storage_options"]
)

# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds
```

### Initialize a gpdtools UserCatData class

To do this, some attribute from the gridded and vector dataset need to be defined.

```python
## CONUS404 crs:
# c404_proj = "+proj=lcc +lat_0=39.1000061035156 +lon_0=-97.9000015258789 +lat_1=30 +lat_2=50 +x_0=0 +y_0=0 +R=6370000 +units=m +no_defs"
c404_proj = ds.crs.attrs['crs_wkt']

data_crs = c404_proj
x_coord = "x"
y_coord = "y"
t_coord = "time"
sdate = "1999-01-01"
edate = "1999-01-07"
var = ["PWAT"]
shp_crs = 4326
shp_poly_idx = "huc12"
wght_gen_crs = 6931

user_data = UserCatData(
    source_ds=ds,
    source_crs=data_crs,
    source_x_coord=x_coord,
    source_y_coord=y_coord,
    source_t_coord=t_coord,
    source_var=var,
    target_gdf=huc12_basins,
    target_crs=shp_crs,
    target_id=shp_poly_idx,
    source_time_period=[sdate, edate]
)
```

### Calculate the weights

The WeightGen class finds the gridded data cells which intersect/overlay the polygons and determines which portions of every cell belong to every polygon.

```python
wght_gen = WeightGen(
    user_data=user_data,
    method="serial",
    weight_gen_crs=6931,
)

wdf = wght_gen.calculate_weights()
```

### Calculate the weight values

The AggGen class uses the weights from WeightGen to calculate the values of the gridded data within each polygon geometry.

```python
agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="serial",
    agg_writer="none",
    weights=wdf
)
ngdf, ds_out = agg_gen.calculate_agg()
```

### Prep the data for display

```python
# Get a subset of the gridded data for displaying
ds_subset = user_data.get_source_subset('PWAT')

var = "PWAT"
time_step = "1999-01-07"
tmin = np.nanmin(ds_subset.sel(time=time_step).values)
tmax = np.nanmax(ds_subset.sel(time=time_step).values)
print(tmin,tmax)
```

```python
# Create and display some data plots

conus_precip_raw = ds_subset.sel(time=time_step).hvplot.quadmesh(
    'lon', 'lat', 'PWAT', cmap='viridis',  alpha=1, grid=True, geo=True, coastline='50m',
    frame_width=300, xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="CONUS404 Precipitable Water", aspect='equal', clim=(tmin, tmax)
)

ngdf[var] = ds_out[var].sel(time=time_step)
conus_precip_agg = ngdf.to_crs(4326).hvplot(
    c='PWAT', geo=True, coastline='50m', frame_width=300, frame_height=450, alpha=1, cmap='viridis',
    clim=(tmin, tmax), xlabel="longitude", ylabel="latitude", clabel="Precipitable Water (m)",
    title="HUC12 area-weighted average", aspect='equal'
)

(EsriTerrain() * conus_precip_raw )  + (EsriTerrain() * conus_precip_agg)
```