---
jupyter:
  jupytext:
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.18.1
  kernelspec:
    display_name: Python 3 (ipykernel)
    language: python
    name: python3
---

# NHGF STAC Catalog - CONUS404 Grid-to-Line Example

In this example, a **SpatioTemporal Asset Catalog (STAC)** developed by Amelia Synder {cite:p}`stac_catalog` is used to access a gridded dataset of interest. This STAC catalog is an evolving collection of gridded data sources, along with the necessary metadata for processing with `gdptools`.

The catalog consists of multiple **STAC-formatted JSON files**, which can be accessed using the Python library **pystac**. With this library, users can:

- Read the catalog
- Select different datasets
- Locate metadata and endpoints for accessing gridded data

In this example, we aggregate **CONUS404 Precipitable Water** data for the **Delaware River Basin** {cite:p}`conus404_2023`.

## STAC Catalog

The STAC catalog is structured around a **root JSON file**, which serves as an index listing all available datasets. Each dataset has a corresponding JSON file containing its metadata, access endpoint, and attributes.

Using **pystac**, we can load the root catalog file and explore the dataset listings.

## GDPtools

`gdptools` is a Python package designed for computing **area-weighted statistics** for grid-to-polygon intersections. It simplifies the process of extracting and aggregating data from gridded datasets using STAC metadata.

### Key Features:

- Built-in methods for incorporating JSON metadata and grid entries for streamlined processing
- Support for **xarray-compatible** datasets, even if JSON metadata is unavailable (users will need to manually define parameters)

This notebook demonstrates how `gdptools` can be used to process CONUS404 data efficiently.

## NLDI Integration

We also use the U.S. Geological Survey’s **Hydro Network-Linked Data Index (NLDI)** {cite:p}`nldi`, accessed via `pynhd`, a Python client from the **HyRiver** package suite {cite:p}`Chegini2021`.

NLDI provides tools for hydrologic network navigation and linked data services, enabling efficient watershed and stream network analyses.

## References

```{bibliography}
:filter: docname in docnames
```

```python
import warnings

import geopandas as gpd
import hvplot.pandas
import hvplot.xarray
import pystac
import xarray as xr

from gdptools import InterpGen, NHGFStacData
from gdptools.helpers import get_stac_collection

warnings.filterwarnings("ignore")

```

### Read in a shapefile of lines to interpolate grid data

```python
gdf = gpd.read_file('../ClimateR-Catalog/test_lines/test_lines_4326.shp')

```

### Search the STAC Catalog for a Data Collection

The STAC catalog is hosted at: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/. You can view the catalog in a web browser or use the `pystac` library to access it programmatically.

**Important:** The catalog is hierarchical—some collections are *container collections* that only group related datasets (e.g., `conus404`), while others are *data collections* that contain actual data assets (e.g., `conus404_daily`, `conus404_hourly`). When using `gdptools`, you must select a **data collection**, not a container collection.

The catalog contains two types of data collections:
- **Zarr collections** (e.g., `conus404_daily`) – identified by a `zarr-s3-*` asset, containing gridded time-series data dimensioned as `(time, y, x)`.
- **GeoTIFF collections** (e.g., `nlcd-LndCov`) – identified by `tif-s3-*` assets at the item level, with one item per time step (e.g., one TIFF per year for NLCD land cover).

`NHGFStacData` auto-detects the collection format and returns the appropriate handler. See the **NHGFStacData_NLCD** example for a GeoTIFF workflow.

We provide a helper function `gdptools.helpers.get_stac_collection()` that searches the catalog recursively, so you can request nested collections directly by their ID (e.g., `"conus404_daily"`) without manually traversing the hierarchy.

> **Note:** For Zarr collections, variables must be dimensioned as **(time, y, x)**. GeoTIFF collections support single-item (e.g., single-year) access for zonal statistics. Grid-to-line interpolation (this notebook) currently applies to Zarr collections only.

```python
collection_id = "conus404_daily"
collection = get_stac_collection(collection_id)
print(collection.assets)

```

### Open the Dataset as an xarray Dataset

Once we have the STAC collection, we can open the zarr asset directly with `xarray.open_zarr()`. The collection's `assets` dictionary contains the zarr endpoint and any required storage options.

Browse the **Data Variables** in the returned Dataset to choose which variables to aggregate onto the HUC polygons.


### Read the dataset in as as a Xarray Dateset

```python
# Read the the zarr collection from the open storage network as a xarray dataset
asset = collection.assets["zarr-s3-osn"]
ds = xr.open_zarr(
    store=asset.href,
    storage_options=asset.extra_fields["xarray:storage_options"]
)

# Look thru the 'Data Variables' to pick which ones to aggregrate to the HUC polygons
ds

```

### Initialize an NHGFStacData Object

The `NHGFStacData` class wraps the STAC collection and target polygons into a single object that `gdptools` uses for weight generation and aggregation.

**Key parameters:**
- `source_collection` – the pystac Collection object retrieved above
- `source_var` – list of variable names to aggregate (must be dimensioned as `(time, y, x)`)
- `target_gdf` – GeoDataFrame of polygons to aggregate onto
- `target_id` – column in `target_gdf` that uniquely identifies each polygon
- `source_time_period` – `[start_date, end_date]` defining the temporal subset

```python
sdate = "1999-01-01"      # Start date of the time dimension of the gridded data
edate = "1999-01-07"      # End data
var = ["PWAT"]            # Gridded data to query
shp_line_idx = "permanent_"    # Column from the geodataframe used to distingiush between lines

user_data = NHGFStacData(
    source_collection=collection,
    source_var=var,
    target_gdf=gdf,
    target_id=shp_line_idx,
    source_time_period=[sdate, edate]
)

```

### Extract Grid Values Along Line Geometries

The `InterpGen` class interpolates gridded data values at evenly spaced points along line geometries and computes summary statistics for each line.

**Key parameters:**
- `user_data` – the `NHGFStacData` (or other `UserData`) object containing the source grid and target lines
- `pt_spacing` – distance (in CRS units) between interpolated points along each line
- `stat` – statistic(s) to compute per line (`"mean"`, `"min"`, `"max"`, `"std"`, or `"all"`)

**Returns** a tuple of two DataFrames:
1. **Line statistics** – summary statistics (mean, min, max, etc.) for each line geometry and time step
2. **Point values** – the interpolated grid values at each sample point along the lines

```python
# InterpGen object
interp_object = InterpGen(
    user_data,
    pt_spacing=100,
    stat='all'
)

# calculate the stats and interpolated points
stats, pts = interp_object.calc_interp()

```

```python
stats

```

```python
stats[ (stats["date"] == "1999-01-02") & (stats["permanent_"] == "155842105") & (stats["varname"] == "PWAT")]

```

```python
pts

```

```python
pts[(pts["permanent_"] == "155842105") & (pts["varname"] == var[0])]

```