NHGF STAC Catalog - NLCD Land Cover Zonal Statistics Example

NHGF STAC Catalog - NLCD Land Cover Zonal Statistics Example#

This example demonstrates how to use GeoTIFF-backed STAC collections with gdptools. While the CONUS404 example works with Zarr time-series data, the NHGF STAC catalog also contains collections stored as individual GeoTIFF files — one per time step.

Here we compute categorical zonal statistics of NLCD (National Land Cover Database) land cover classes for HUC12 basins in the Delaware River Basin.

Key Differences from Zarr Workflow#

	Zarr (e.g., CONUS404)	GeoTIFF (e.g., NLCD)
Asset type	Single `zarr-s3-osn` asset on the collection	One `tif-s3-osn` asset per item (per year)
Time dimension	`(time, y, x)` — multiple time steps in one file	No time dimension — one raster per year
`source_time_period`	Selects a time range from the Zarr store	Selects which STAC item (year) to use
Typical analysis	Area-weighted temporal aggregation	Zonal statistics (categorical or continuous)

NHGFStacData auto-detects the collection format and returns the appropriate handler (NHGFStacZarrData or NHGFStacTiffData) transparently.

import warnings

import geopandas as gpd
import hvplot.pandas
from pygeohydro import watershed
from pynhd import WaterData

from gdptools import NHGFStacData, ZonalGen
from gdptools.helpers import get_stac_collection

warnings.filterwarnings("ignore")

Load the Target Polygons#

We use the same Delaware River Basin HUC12 polygons as the CONUS404 example. These are retrieved from the Watershed Boundary Dataset (WBD) via pygeohydro.

wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData("wbd12").bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[
    huc12_basins["huc12"].str.startswith(("020401", "020402"))
]
huc12_basins.head()

	geometry	tnmid	metasourceid	sourcedatadesc	sourceoriginator	sourcefeatureid	loaddate	gnis_id	areaacres	areasqkm	states	huc12	name	hutype	humod	tohuc	globalid
0	MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ...	{001055CD-7B5E-4F96-ABB1-22923FD2B454}	NaN	None	None	None	2013-01-18T07:08:41Z	None	16178.10	65.47	NJ	020402060505	White Marsh Run-Maurice River	S	NM	020402060507	{A70A97FA-E29C-11E2-8094-0021280458E6}
1	MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ...	{009DA440-393A-46A0-BBA3-C38139823414}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:11:06Z	None	29107.93	117.80	NJ	020401050502	Lubbers Run-Musconetcong River	S	NM	020401050503	{B830B4F8-E29C-11E2-8094-0021280458E6}
2	MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ...	{00F73634-FA00-464F-93F0-F0BCCB464567}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:58Z	None	35229.84	142.57	PA	020402030305	Upper Maiden Creek	S	NM	020402030307	{BAF5E77A-E29C-11E2-8094-0021280458E6}
3	MULTIPOLYGON (((-75.13279 39.88601, -75.13439 ...	{017B8A58-E959-4600-B212-8EB921F0F9F7}	{E9F5C988-2313-440E-A05E-C5871E2773A6}	None	None	None	2017-10-03T20:11:04Z	None	24920.90	100.85	NJ,PA	020402020507	Woodbury Creek-Delaware River	S	TF	020402020607	{B122D6B7-E29C-11E2-8094-0021280458E6}
4	MULTIPOLYGON (((-74.92817 40.55256, -74.92833 ...	{01B4598B-0623-407F-9133-3040D99A8D1A}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:52Z	None	15086.80	61.05	NJ	020401050905	Lockatong Creek	S	NM	020401050908	{A6AB04BE-E29C-11E2-8094-0021280458E6}

Fetch the NLCD Land Cover Collection#

The NLCD data is organized under the nlcd parent collection in the STAC catalog, with sub-collections for each product (e.g., nlcd-LndCov for land cover). The helper get_stac_collection() searches recursively, so we can request "nlcd-LndCov" directly.

Unlike Zarr collections, GeoTIFF collections have no collection-level data assets. Instead, each STAC item represents a single year and contains a tif-s3-osn asset pointing to the GeoTIFF on S3.

collection = get_stac_collection("nlcd-LndCov")
print(f"Collection: {collection.id}")
print(f"Collection-level assets: {list(collection.assets.keys())}")

# List the first few items (each item is one year of NLCD)
items = list(collection.get_items())
print(f"\nNumber of items: {len(items)}")
for item in items[:5]:
    print(f"  {item.id}  -  {item.properties.get('datetime')}")

Collection: nlcd-LndCov
Collection-level assets: []

Number of items: 40
  Annual_NLCD_LndCov_1985_CU_C1V1  -  1985-01-01T00:00:00Z
  Annual_NLCD_LndCov_1986_CU_C1V1  -  1986-01-01T00:00:00Z
  Annual_NLCD_LndCov_1987_CU_C1V1  -  1987-01-01T00:00:00Z
  Annual_NLCD_LndCov_1988_CU_C1V1  -  1988-01-01T00:00:00Z
  Annual_NLCD_LndCov_1989_CU_C1V1  -  1989-01-01T00:00:00Z

Create the NHGFStacData Object#

We pass the STAC collection to NHGFStacData just like the Zarr workflow. The factory auto-detects that this is a GeoTIFF collection and returns an NHGFStacTiffData instance.

source_time_period selects the STAC item whose datetime falls in the given range (here, the 2021 NLCD product).
source_var is the variable name used to label results.

user_data = NHGFStacData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf=huc12_basins,
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)

print(f"Returned type: {type(user_data).__name__}")
print(f"Class type:    {user_data.get_class_type()}")
print(f"Source CRS:    {user_data.source_ds.rio.crs}")

Returned type: NHGFStacTiffData
Class type:    NHGFStacTiffData
Source CRS:    PROJCS["AEA        WGS84",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["latitude_of_center",23],PARAMETER["longitude_of_center",-96],PARAMETER["standard_parallel_1",29.5],PARAMETER["standard_parallel_2",45.5],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]

Compute Categorical Zonal Statistics#

NLCD land cover values are integers representing land cover classes (e.g., 11 = Open Water, 21 = Developed Open Space, 41 = Deciduous Forest, etc.). We use ZonalGen with categorical=True to compute the proportion of each class within every HUC12 polygon.

zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="exactextract",
    zonal_writer="csv",
    out_path=".",
    file_prefix="drb_nlcd_2021",
)

stats = zonal_gen.calculate_zonal(categorical=True)
stats.head()

	11	21	22	23	24	31	41	42	43	52	71	81	82	90	95	250	count
huc12
020402060505	0.033158	0.085917	0.193784	0.122581	0.050119	0.018544	0.123280	0.023426	0.073461	0.010044	0.002161	0.009373	0.117233	0.114004	0.022915	0.0	72745
020401050502	0.023007	0.099705	0.065001	0.025624	0.006489	0.009425	0.637710	0.000275	0.005815	0.000620	0.007187	0.012210	0.007113	0.097954	0.001864	0.0	130884
020402030305	0.001496	0.065787	0.021263	0.007005	0.000548	0.000508	0.369402	0.000315	0.025328	0.001519	0.000303	0.185952	0.315260	0.005277	0.000038	0.0	158411
020402020507	0.220036	0.081773	0.194387	0.168000	0.140793	0.004056	0.038694	0.000054	0.001013	0.003698	0.002074	0.002793	0.016708	0.098271	0.027650	0.0	112057
020401050905	0.002738	0.072786	0.027082	0.009314	0.001413	0.000619	0.309156	0.000015	0.025494	0.004639	0.000472	0.340643	0.120848	0.083785	0.000996	0.0	67838

Identify the Dominant Land Cover Class#

For each HUC12 basin, find the NLCD class with the highest proportion and merge the result into the target GeoDataFrame for mapping.

# NLCD Land Cover class names
nlcd_classes = {
    "11": "Open Water",
    "12": "Perennial Ice/Snow",
    "21": "Developed, Open Space",
    "22": "Developed, Low Intensity",
    "23": "Developed, Medium Intensity",
    "24": "Developed, High Intensity",
    "31": "Barren Land",
    "41": "Deciduous Forest",
    "42": "Evergreen Forest",
    "43": "Mixed Forest",
    "52": "Shrub/Scrub",
    "71": "Grassland/Herbaceous",
    "81": "Pasture/Hay",
    "82": "Cultivated Crops",
    "90": "Woody Wetlands",
    "95": "Emergent Herbaceous Wetlands",
}

# NLCD official color palette
nlcd_colors = {
    "Open Water": "#466b9f",
    "Perennial Ice/Snow": "#d1def8",
    "Developed, Open Space": "#dec5c5",
    "Developed, Low Intensity": "#d99282",
    "Developed, Medium Intensity": "#eb0000",
    "Developed, High Intensity": "#ab0000",
    "Barren Land": "#b3ac9f",
    "Deciduous Forest": "#68ab5f",
    "Evergreen Forest": "#1c5f2c",
    "Mixed Forest": "#b5c58f",
    "Shrub/Scrub": "#ccb879",
    "Grassland/Herbaceous": "#dfdfc2",
    "Pasture/Hay": "#dbd83d",
    "Cultivated Crops": "#ab6c28",
    "Woody Wetlands": "#b8d9eb",
    "Emergent Herbaceous Wetlands": "#6c9fb8",
}

# Drop the 'count' column before finding the dominant class
class_cols = [c for c in stats.columns if c != "count"]
dominant = stats[class_cols].idxmax(axis=1).reset_index()
dominant.columns = ["huc12", "dominant_class"]

# Column names may be int or str — normalize to str before mapping
dominant["dominant_class"] = dominant["dominant_class"].astype(str).map(nlcd_classes)

gdf = huc12_basins.merge(dominant, on="huc12", how="left")

# Build a sorted color list matching the categories present in the data
categories = sorted(gdf["dominant_class"].dropna().unique())

gdf.head()

	geometry	tnmid	metasourceid	sourcedatadesc	sourceoriginator	sourcefeatureid	loaddate	gnis_id	areaacres	areasqkm	states	huc12	name	hutype	humod	tohuc	globalid	dominant_class
0	MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ...	{001055CD-7B5E-4F96-ABB1-22923FD2B454}	NaN	None	None	None	2013-01-18T07:08:41Z	None	16178.10	65.47	NJ	020402060505	White Marsh Run-Maurice River	S	NM	020402060507	{A70A97FA-E29C-11E2-8094-0021280458E6}	Developed, Low Intensity
1	MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ...	{009DA440-393A-46A0-BBA3-C38139823414}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:11:06Z	None	29107.93	117.80	NJ	020401050502	Lubbers Run-Musconetcong River	S	NM	020401050503	{B830B4F8-E29C-11E2-8094-0021280458E6}	Deciduous Forest
2	MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ...	{00F73634-FA00-464F-93F0-F0BCCB464567}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:58Z	None	35229.84	142.57	PA	020402030305	Upper Maiden Creek	S	NM	020402030307	{BAF5E77A-E29C-11E2-8094-0021280458E6}	Deciduous Forest
3	MULTIPOLYGON (((-75.13279 39.88601, -75.13439 ...	{017B8A58-E959-4600-B212-8EB921F0F9F7}	{E9F5C988-2313-440E-A05E-C5871E2773A6}	None	None	None	2017-10-03T20:11:04Z	None	24920.90	100.85	NJ,PA	020402020507	Woodbury Creek-Delaware River	S	TF	020402020607	{B122D6B7-E29C-11E2-8094-0021280458E6}	Open Water
4	MULTIPOLYGON (((-74.92817 40.55256, -74.92833 ...	{01B4598B-0623-407F-9133-3040D99A8D1A}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:52Z	None	15086.80	61.05	NJ	020401050905	Lockatong Creek	S	NM	020401050908	{A6AB04BE-E29C-11E2-8094-0021280458E6}	Pasture/Hay

Visualize the Results#

Map the dominant NLCD land cover class for each HUC12 basin in the Delaware River Basin.

import matplotlib.pyplot as plt
from matplotlib.patches import Patch

fig, ax = plt.subplots(figsize=(10, 12))
plot_gdf = gdf.to_crs(4326)

# Color each polygon by its NLCD class
colors = plot_gdf["dominant_class"].map(nlcd_colors).fillna("#cccccc")
plot_gdf.plot(ax=ax, color=colors, alpha=0.8, edgecolor="gray", linewidth=0.2)

# Build legend from classes present in the data
legend_handles = [
    Patch(facecolor=nlcd_colors[c], edgecolor="gray", label=c)
    for c in categories
]
ax.legend(
    handles=legend_handles,
    loc="lower left",
    fontsize=8,
    title="Land Cover Class",
)
ax.set_title("Dominant NLCD 2021 Land Cover Class\nDelaware River Basin HUC12s")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.tight_layout()

../../_images/ec94973f5b0d1d584697771888b4d5a65f6f9892ec7e83048e9f454f59f0d397.png