---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.18.1
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# NHGF STAC Catalog - NLCD Land Cover Zonal Statistics Example

This example demonstrates how to use **GeoTIFF-backed** STAC collections with
`gdptools`. While the [CONUS404 example](NHGFStacData_CONUS404.ipynb) works
with Zarr time-series data, the NHGF STAC catalog also contains collections
stored as individual GeoTIFF files — one per time step.

Here we compute **categorical zonal statistics** of
[NLCD](https://www.mrlc.gov/) (National Land Cover Database) land cover classes
for HUC12 basins in the Delaware River Basin.

## Key Differences from Zarr Workflow

| | Zarr (e.g., CONUS404) | GeoTIFF (e.g., NLCD) |
|---|---|---|
| Asset type | Single `zarr-s3-osn` asset on the collection | One `tif-s3-osn` asset per item (per year) |
| Time dimension | `(time, y, x)` — multiple time steps in one file | No time dimension — one raster per year |
| `source_time_period` | Selects a time range from the Zarr store | Selects which STAC item (year) to use |
| Typical analysis | Area-weighted temporal aggregation | Zonal statistics (categorical or continuous) |

`NHGFStacData` auto-detects the collection format and returns the appropriate
handler (`NHGFStacZarrData` or `NHGFStacTiffData`) transparently.

```{code-cell} ipython3
import warnings

import geopandas as gpd
import hvplot.pandas
from pygeohydro import watershed
from pynhd import WaterData

from gdptools import NHGFStacData, ZonalGen
from gdptools.helpers import get_stac_collection

warnings.filterwarnings("ignore")
```

## Load the Target Polygons

We use the same Delaware River Basin HUC12 polygons as the CONUS404 example.
These are retrieved from the Watershed Boundary Dataset (WBD) via `pygeohydro`.

```{code-cell} ipython3
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[
    huc12_basins["huc12"].str.startswith(("020401", "020402"))
]
huc12_basins.head()
```

## Fetch the NLCD Land Cover Collection

The NLCD data is organized under the `nlcd` parent collection in the STAC
catalog, with sub-collections for each product (e.g., `nlcd-LndCov` for land
cover). The helper `get_stac_collection()` searches recursively, so we can
request `"nlcd-LndCov"` directly.

Unlike Zarr collections, GeoTIFF collections have **no collection-level data
assets**. Instead, each STAC *item* represents a single year and contains a
`tif-s3-osn` asset pointing to the GeoTIFF on S3.

```{code-cell} ipython3
collection = get_stac_collection("nlcd-LndCov")
print(f"Collection: {collection.id}")
print(f"Collection-level assets: {list(collection.assets.keys())}")

# List the first few items (each item is one year of NLCD)
items = list(collection.get_items())
print(f"\nNumber of items: {len(items)}")
for item in items[:5]:
    print(f"  {item.id}  -  {item.properties.get('datetime')}")
```

## Create the NHGFStacData Object

We pass the STAC collection to `NHGFStacData` just like the Zarr workflow.
The factory auto-detects that this is a GeoTIFF collection and returns an
`NHGFStacTiffData` instance.

- `source_time_period` selects the STAC item whose datetime falls in the
  given range (here, the 2021 NLCD product).
- `source_var` is the variable name used to label results.

```{code-cell} ipython3
user_data = NHGFStacData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf=huc12_basins,
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)

print(f"Returned type: {type(user_data).__name__}")
print(f"Class type:    {user_data.get_class_type()}")
print(f"Source CRS:    {user_data.source_ds.rio.crs}")
```

## Compute Categorical Zonal Statistics

NLCD land cover values are integers representing land cover classes (e.g.,
11 = Open Water, 21 = Developed Open Space, 41 = Deciduous Forest, etc.).
We use `ZonalGen` with `categorical=True` to compute the proportion of each
class within every HUC12 polygon.

```{code-cell} ipython3
zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="exactextract",
    zonal_writer="csv",
    out_path=".",
    file_prefix="drb_nlcd_2021",
)

stats = zonal_gen.calculate_zonal(categorical=True)
stats.head()
```

## Identify the Dominant Land Cover Class

For each HUC12 basin, find the NLCD class with the highest proportion and
merge the result into the target GeoDataFrame for mapping.

```{code-cell} ipython3
# NLCD Land Cover class names
nlcd_classes = {
    "11": "Open Water",
    "12": "Perennial Ice/Snow",
    "21": "Developed, Open Space",
    "22": "Developed, Low Intensity",
    "23": "Developed, Medium Intensity",
    "24": "Developed, High Intensity",
    "31": "Barren Land",
    "41": "Deciduous Forest",
    "42": "Evergreen Forest",
    "43": "Mixed Forest",
    "52": "Shrub/Scrub",
    "71": "Grassland/Herbaceous",
    "81": "Pasture/Hay",
    "82": "Cultivated Crops",
    "90": "Woody Wetlands",
    "95": "Emergent Herbaceous Wetlands",
}

# NLCD official color palette
nlcd_colors = {
    "Open Water": "#466b9f",
    "Perennial Ice/Snow": "#d1def8",
    "Developed, Open Space": "#dec5c5",
    "Developed, Low Intensity": "#d99282",
    "Developed, Medium Intensity": "#eb0000",
    "Developed, High Intensity": "#ab0000",
    "Barren Land": "#b3ac9f",
    "Deciduous Forest": "#68ab5f",
    "Evergreen Forest": "#1c5f2c",
    "Mixed Forest": "#b5c58f",
    "Shrub/Scrub": "#ccb879",
    "Grassland/Herbaceous": "#dfdfc2",
    "Pasture/Hay": "#dbd83d",
    "Cultivated Crops": "#ab6c28",
    "Woody Wetlands": "#b8d9eb",
    "Emergent Herbaceous Wetlands": "#6c9fb8",
}

# Drop the 'count' column before finding the dominant class
class_cols = [c for c in stats.columns if c != "count"]
dominant = stats[class_cols].idxmax(axis=1).reset_index()
dominant.columns = ["huc12", "dominant_class"]

# Column names may be int or str — normalize to str before mapping
dominant["dominant_class"] = dominant["dominant_class"].astype(str).map(nlcd_classes)

gdf = huc12_basins.merge(dominant, on="huc12", how="left")

# Build a sorted color list matching the categories present in the data
categories = sorted(gdf["dominant_class"].dropna().unique())

gdf.head()
```

## Visualize the Results

Map the dominant NLCD land cover class for each HUC12 basin in the Delaware
River Basin.

```{code-cell} ipython3
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

fig, ax = plt.subplots(figsize=(10, 12))
plot_gdf = gdf.to_crs(4326)

# Color each polygon by its NLCD class
colors = plot_gdf["dominant_class"].map(nlcd_colors).fillna("#cccccc")
plot_gdf.plot(ax=ax, color=colors, alpha=0.8, edgecolor="gray", linewidth=0.2)

# Build legend from classes present in the data
legend_handles = [
    Patch(facecolor=nlcd_colors[c], edgecolor="gray", label=c)
    for c in categories
]
ax.legend(
    handles=legend_handles,
    loc="lower left",
    fontsize=8,
    title="Land Cover Class",
)
ax.set_title("Dominant NLCD 2021 Land Cover Class\nDelaware River Basin HUC12s")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.tight_layout()
```