User Data Classes

Contents

User Data Classes#

The user data classes form the foundation of the gdptools package, providing standardized interfaces for accessing and processing geospatial data from diverse sources. These classes abstract the complexity of different data formats, coordinate systems, and access methods, presenting a unified API for spatial analysis workflows.

gdptools provides specialized data classes for common geospatial data sources:

  • UserData: Abstract base class defining the common interface

  • ClimRCatData: Climate-R catalog integration with automatic metadata handling

  • UserCatData: User-provided xarray datasets with flexible configuration

  • NHGFStacData: NHGF STAC catalog factory — auto-detects Zarr vs GeoTIFF

  • UserTiffData: GeoTIFF and raster data processing with zonal statistics

All classes implement comprehensive data validation, coordinate reference system handling, and spatial/temporal subsetting capabilities.


Key Features#

Unified Interface#

  • Consistent API: All data classes implement the same abstract methods

  • Automatic validation: CRS validation, coordinate name checking, and data compatibility

  • Error handling: Comprehensive error messages with suggested fixes

Data Source Support#

  • Catalog integration: Seamless access to Climate-R and NHGF STAC catalogs

  • File formats: NetCDF, Zarr, GeoTIFF, and other xarray-compatible formats

  • Web services: OPeNDAP, THREDDS, and other network data sources

Spatial Operations#

  • Automatic reprojection: Handles CRS transformations between source and target data

  • Intersection validation: Ensures spatial overlap between datasets and geometries

  • Efficient subsetting: Spatially subset data to minimize memory usage and processing time

Temporal Handling#

  • Flexible time formats: Support for various datetime formats and time coordinate names

  • Period subsetting: Automatic temporal slicing based on user-specified ranges

  • Calendar handling: Proper handling of different calendar systems


Abstract Base Class (UserData)#

The UserData class defines the common interface that all data source classes must implement. It establishes the contract for data preparation, spatial subsetting, and aggregation operations.

class UserData[source]#

Bases: ABC

Abstract base class for standardizing geospatial data inputs.

This class defines the common interface that all data source classes must implement. It ensures consistent data preparation, spatial subsetting, and aggregation capabilities across different data sources (catalogs, files, web services).

The class enforces a standardized workflow for data handling: 1. Data loading and validation 2. Coordinate reference system handling 3. Spatial and temporal subsetting 4. Data preparation for weight generation and aggregation

All subclasses must implement the abstract methods to provide source-specific functionality while maintaining interface consistency.

Notes

This is an abstract base class and cannot be instantiated directly. Use one of the concrete subclasses (ClimRCatData, UserCatData, NHGFStacData, UserTiffData) instead.

Initialize the data source.

This method must be implemented by subclasses to handle source-specific initialization requirements.

abstract __init__()[source]#

Initialize the data source.

This method must be implemented by subclasses to handle source-specific initialization requirements.

abstract get_target_crs()[source]#

Get the coordinate reference system of the target geometries.

Returns:

The coordinate reference system of the target vector data.

Return type:

CRS

abstract get_source_subset(key)[source]#

Get a spatially and temporally subset of the source data.

Parameters:

key (str) – Variable name or identifier for the data to subset.

Returns:

A spatially and temporally subset DataArray for the specified variable.

Return type:

DataArray

abstract prep_wght_data()[source]#

Prepare data for weight generation calculations.

Returns:

A WeightData instance containing the necessary data for calculating spatial intersection weights between source and target geometries.

Return type:

WeightData

abstract prep_interp_data(key, poly_id)[source]#

Prepare data for interpolation operations.

Parameters:
  • key (str) – Variable name or identifier for the data to prepare.

  • poly_id (Union[str, int]) – Identifier for the target polygon geometry.

Returns:

An AggData instance configured for interpolation operations.

Return type:

AggData

abstract prep_agg_data(key)[source]#

Prepare data for aggregation operations.

Parameters:

key (str) – Variable name or identifier for the data to prepare.

Returns:

An AggData instance configured for aggregation operations.

Return type:

AggData

abstract get_vars()[source]#

Get the list of available variables in the data source.

Returns:

List of variable names available for processing.

Return type:

list[str]

abstract get_feature_id()[source]#

Get the identifier column name for target geometries.

Returns:

The column name used as the unique identifier for target geometries.

Return type:

str

abstract get_class_type()[source]#

Get the type identifier for this data source class.

Returns:

A string identifier for the data source type (e.g., “ClimRCatData”).

Return type:

str


Climate-R Catalog (ClimRCatData)#

The ClimRCatData class provides seamless integration with the Climate-R catalog system, offering access to a wide range of climate datasets with automatic metadata handling.

class ClimRCatData(*, source_cat_dict, target_gdf, target_id, source_time_period, **kwargs)[source]#

Bases: UserData

Interface for Climate-R catalog datasets with automatic metadata handling.

This class provides seamless integration with the Climate-R catalog system, developed by Mike Johnson and available at mikejohnson51/climateR-catalogs. It automatically handles metadata extraction, coordinate system detection, and spatial/temporal subsetting for catalog-based datasets.

The class accepts Climate-R catalog dictionaries and automatically configures data access parameters, coordinate names, and temporal ranges based on the catalog metadata.

source_cat_dict#

Dictionary mapping variable names to Climate-R catalog metadata.

target_gdf#

GeoDataFrame containing target geometries for spatial operations.

target_id#

Column name for unique identifiers in target geometries.

source_time_period#

Processed time period for temporal subsetting.

Examples

Basic usage with Climate-R catalog:

import pandas as pd
from gdptools.data.user_data import ClimRCatData

# Load Climate-R catalog
cat_url = "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
cat = pd.read_parquet(cat_url)

# Create catalog dictionary for TerraClimate variables
cat_vars = ["aet", "pet", "PDSI"]
cat_params = [
    cat.query("id == 'terraclim' & variable == @var").to_dict("records")[0]
    for var in cat_vars
]
source_cat_dict = dict(zip(cat_vars, cat_params))

# Initialize data source
data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Access data
variables = data.get_vars()
subset = data.get_source_subset("aet")

Multiple variables from GridMET:

# GridMET temperature and precipitation
gm_vars = ["tmmn", "tmmx", "pr"]
gm_params = [
    cat.query("id == 'gridmet' & variable == @var").to_dict("records")[0]
    for var in gm_vars
]
source_cat_dict = dict(zip(gm_vars, gm_params))

data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf=target_polygons,
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)

Initialize ClimRCatData for Climate-R catalog integration.

Sets up data access for Climate-R catalog datasets with automatic metadata handling, coordinate system detection, and spatial/temporal validation.

Parameters:
  • source_cat_dict (dict[str, dict[str, Any]]) – Dictionary mapping variable names to Climate-R catalog metadata dictionaries. Each entry should contain the complete catalog record for a variable, including URL, coordinate names, CRS, and temporal information.

  • target_gdf (str | Path | GeoDataFrame) – GeoDataFrame containing target geometries, or path to a shapefile/GeoPackage that can be read by geopandas.

  • target_id (str) – Column name in target_gdf to use as unique identifier for geometries in weight calculations and aggregations.

  • source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining the temporal range for data subsetting. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or [“YYYY-MM-DD HH:MM:SS”, “YYYY-MM-DD HH:MM:SS”].

Raises:
  • KeyError – If target_id is not found in target_gdf columns.

  • ValueError – If source_cat_dict is empty or contains invalid entries.

  • TypeError – If catalog entries are missing required metadata fields.

Notes

The catalog dictionary structure should match the Climate-R catalog format with fields like ‘URL’, ‘X_name’, ‘Y_name’, ‘T_name’, ‘crs’, ‘toptobottom’, etc. Invalid or incomplete entries will raise errors during initialization.

Examples

Initialize with TerraClimate data:

import pandas as pd

cat_url = (
    "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
)
cat = pd.read_parquet(cat_url)

# Create catalog entry for actual evapotranspiration
aet_params = (
    cat.query("id == 'terraclim' & variable == 'aet'").to_dict("records")[0]
)
source_cat_dict = {"aet": aet_params}

data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"],
)
__init__(*, source_cat_dict, target_gdf, target_id, source_time_period, **kwargs)[source]#

Initialize ClimRCatData for Climate-R catalog integration.

Sets up data access for Climate-R catalog datasets with automatic metadata handling, coordinate system detection, and spatial/temporal validation.

Parameters:
  • source_cat_dict (dict[str, dict[str, Any]]) – Dictionary mapping variable names to Climate-R catalog metadata dictionaries. Each entry should contain the complete catalog record for a variable, including URL, coordinate names, CRS, and temporal information.

  • target_gdf (str | Path | GeoDataFrame) – GeoDataFrame containing target geometries, or path to a shapefile/GeoPackage that can be read by geopandas.

  • target_id (str) – Column name in target_gdf to use as unique identifier for geometries in weight calculations and aggregations.

  • source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining the temporal range for data subsetting. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or [“YYYY-MM-DD HH:MM:SS”, “YYYY-MM-DD HH:MM:SS”].

Raises:
  • KeyError – If target_id is not found in target_gdf columns.

  • ValueError – If source_cat_dict is empty or contains invalid entries.

  • TypeError – If catalog entries are missing required metadata fields.

Notes

The catalog dictionary structure should match the Climate-R catalog format with fields like ‘URL’, ‘X_name’, ‘Y_name’, ‘T_name’, ‘crs’, ‘toptobottom’, etc. Invalid or incomplete entries will raise errors during initialization.

Examples

Initialize with TerraClimate data:

import pandas as pd

cat_url = (
    "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
)
cat = pd.read_parquet(cat_url)

# Create catalog entry for actual evapotranspiration
aet_params = (
    cat.query("id == 'terraclim' & variable == 'aet'").to_dict("records")[0]
)
source_cat_dict = {"aet": aet_params}

data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"],
)
get_target_crs()[source]#

Get the coordinate reference system of the target geometries.

Returns:

The coordinate reference system of the target vector data.

Return type:

CRS

get_source_subset(key)[source]#

Get a spatially and temporally subset of the source data.

This method uses the metadata from the Climate-R catalog to retrieve and subset the data for a specific variable.

Parameters:

key (str) – The variable name to subset from the catalog.

Returns:

A subsetted xarray DataArray.

Return type:

xr.DataArray

prep_interp_data(key, poly_id)[source]#

Prep AggData from ClimRCatData.

Parameters:
  • key (str) – Name of the xarray gridded data variable.

  • poly_id (Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.

Returns:

An instance of the AggData class, ready for interpolation.

Return type:

AggData

prep_agg_data(key)[source]#

Prepare ClimRCatData data for aggregation methods.

Parameters:

key (str) – The variable name to prepare for aggregation.

Returns:

An AggData instance ready for aggregation.

Return type:

AggData

prep_wght_data()[source]#

Prepare data for weight generation calculations.

Returns:

Data required for calculating spatial intersection weights between source and target geometries.

Return type:

WeightData

get_feature_id()[source]#

Return target_id.

get_vars()[source]#

Return list of source_cat_dict keys, proxy for varnames.

get_class_type()[source]#

Get the type identifier for this data source class.

Examples (ClimRCatData)#

Basic Climate-R Usage#

import pandas as pd
from gdptools.data.user_data import ClimRCatData

# Load Climate-R catalog
cat_url = "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
cat = pd.read_parquet(cat_url)

# Query for GridMET temperature data
temp_params = cat.query("id == 'gridmet' & variable == 'tmmn'").to_dict("records")[0]
source_cat_dict = {"tmmn": temp_params}

# Initialize data source
data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Access data
variables = data.get_vars()
subset = data.get_source_subset("tmmn")

Multiple Variables#

# gridMET climate variables
gridmet_vars = ["tmmx", "tmmn", "pr"]
gridmet_params = [
    cat.query("id == 'gridmet' & variable == @var").to_dict("records")[0]
    for var in gridmet_vars
]
source_cat_dict = dict(zip(gridmet_vars, gridmet_params))

data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf=watersheds_gdf,
    target_id="watershed_id",
    source_time_period=["2015-01-01", "2021-12-31"]
)

# Process all variables
for var in data.get_vars():
    agg_data = data.prep_agg_data(var)
    print(f"Prepared {var} for aggregation")

Custom Datasets (UserCatData)#

The UserCatData class handles user-provided xarray datasets with full control over coordinate names, variable selection, and coordinate reference systems.

class UserCatData(*, source_ds, source_crs, source_x_coord, source_y_coord, source_t_coord, source_var, target_gdf, target_crs, target_id, source_time_period)[source]#

Bases: UserData

Handler for user-provided xarray datasets with custom configuration.

This class provides a flexible interface for working with user-supplied gridded datasets that are not available through catalogs. It handles xarray datasets from local files, URLs, or in-memory objects, with full control over coordinate names, variable selection, and coordinate reference systems.

The class performs comprehensive validation of input parameters and data compatibility, ensuring that coordinate names exist, variables are available, and coordinate reference systems are valid.

source_ds#

The source xarray Dataset containing gridded data.

target_gdf#

GeoDataFrame containing target geometries.

target_id#

Column name for unique identifiers in target geometries.

source_time_period#

Processed time period for temporal subsetting.

Examples

Basic usage with local NetCDF file:

import xarray as xr
from gdptools.data.user_data import UserCatData

# Load custom dataset
ds = xr.open_dataset("climate_data.nc")

data = UserCatData(
    source_ds=ds,
    source_crs=4326,
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["temperature", "precipitation"],
    target_gdf="watersheds.shp",
    target_crs=4326,
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

Using URL data source:

# Remote dataset
data = UserCatData(
    source_ds="https://example.com/climate.nc",
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var="temperature",
    target_gdf=polygons_gdf,
    target_crs=3857,
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)

Multiple variables with different CRS:

# Projected dataset
data = UserCatData(
    source_ds="projected_data.nc",
    source_crs=3857,  # Web Mercator
    source_x_coord="x",
    source_y_coord="y",
    source_t_coord="time",
    source_var=["var1", "var2", "var3"],
    target_gdf="regions.gpkg",
    target_crs=4326,
    target_id="region_id",
    source_time_period=["2020-06-01", "2020-08-31"]
)

Initialize UserCatData for custom gridded datasets.

Sets up data access for user-provided xarray datasets with comprehensive validation of coordinates, variables, and coordinate reference systems.

Parameters:
  • source_ds (str | Dataset) – Source dataset as xarray Dataset, file path, or URL. Can be any data source readable by xarray.open_dataset().

  • source_crs (str | int | CRS) – Coordinate reference system for the source dataset. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.

  • source_x_coord (str) – Name of the x-coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_y_coord (str) – Name of the y-coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_t_coord (str) – Name of the time coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_var (str | list[str]) – Variable name(s) to use for processing. Can be a single string or list of strings. All variables must exist in source_ds.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries as GeoDataFrame or file path. Can be any format readable by geopandas.read_file().

  • target_crs (str | int | CRS) – Coordinate reference system for target geometries. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.

  • target_id (str) – Column name in target_gdf to use as unique identifier. Must exist in target_gdf columns.

  • source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining temporal range. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or with time stamps.

Raises:
  • KeyError – If target_id is not in target_gdf columns, or if coordinate names or variables are not found in the dataset.

  • ValueError – If source_crs or target_crs are invalid CRS specifications.

  • FileNotFoundError – If source_ds or target_gdf file paths don’t exist.

Note

This class performs extensive validation upon initialization, including checking for the existence of specified coordinates and variables, validating CRS definitions, and ensuring the target ID exists in the geometries.

Examples

Initialize with local NetCDF file:

import xarray as xr
data = UserCatData(
    source_ds="temperature_data.nc",
    source_crs=4326,
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["temperature"],
    target_gdf="watersheds.shp",
    target_crs=4326,
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

Initialize with in-memory dataset:

ds = xr.open_dataset("climate.nc")
data = UserCatData(
    source_ds=ds,
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var=["temp", "precip"],
    target_gdf=polygons_gdf,
    target_crs=3857,
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)
__init__(*, source_ds, source_crs, source_x_coord, source_y_coord, source_t_coord, source_var, target_gdf, target_crs, target_id, source_time_period)[source]#

Initialize UserCatData for custom gridded datasets.

Sets up data access for user-provided xarray datasets with comprehensive validation of coordinates, variables, and coordinate reference systems.

Parameters:
  • source_ds (str | Dataset) – Source dataset as xarray Dataset, file path, or URL. Can be any data source readable by xarray.open_dataset().

  • source_crs (str | int | CRS) – Coordinate reference system for the source dataset. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.

  • source_x_coord (str) – Name of the x-coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_y_coord (str) – Name of the y-coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_t_coord (str) – Name of the time coordinate dimension in source_ds. Must exist in dataset coordinates.

  • source_var (str | list[str]) – Variable name(s) to use for processing. Can be a single string or list of strings. All variables must exist in source_ds.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries as GeoDataFrame or file path. Can be any format readable by geopandas.read_file().

  • target_crs (str | int | CRS) – Coordinate reference system for target geometries. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.

  • target_id (str) – Column name in target_gdf to use as unique identifier. Must exist in target_gdf columns.

  • source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining temporal range. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or with time stamps.

Raises:
  • KeyError – If target_id is not in target_gdf columns, or if coordinate names or variables are not found in the dataset.

  • ValueError – If source_crs or target_crs are invalid CRS specifications.

  • FileNotFoundError – If source_ds or target_gdf file paths don’t exist.

Note

This class performs extensive validation upon initialization, including checking for the existence of specified coordinates and variables, validating CRS definitions, and ensuring the target ID exists in the geometries.

Examples

Initialize with local NetCDF file:

import xarray as xr
data = UserCatData(
    source_ds="temperature_data.nc",
    source_crs=4326,
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["temperature"],
    target_gdf="watersheds.shp",
    target_crs=4326,
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

Initialize with in-memory dataset:

ds = xr.open_dataset("climate.nc")
data = UserCatData(
    source_ds=ds,
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var=["temp", "precip"],
    target_gdf=polygons_gdf,
    target_crs=3857,
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)
classmethod __repr__()[source]#

Print class name.

get_target_crs()[source]#

Return the coordinate reference system (CRS) for the source data.

This method provides the CRS used by the target geometries.

Returns:

The CRS associated with the target geometries.

Return type:

str | int | CRS

get_source_subset(key)[source]#

Get a spatially and temporally subset of the source dataset.

This method applies the pre-calculated spatial and temporal subset dictionary to the source dataset for the given variable.

Parameters:

key (str) – Name of the xarray gridded data variable.

Returns:

A subsetted xarray DataArray of the original source gridded data.

Return type:

xr.DataArray

get_feature_id()[source]#

Return target_id.

get_vars()[source]#

Return list of vars in data.

get_class_type()[source]#

Get the type identifier for this data source class.

prep_interp_data(key, poly_id)[source]#

Prep AggData from UserCatData.

Parameters:
  • key (str) – Name of the xarray gridded data variable.

  • poly_id (Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.

Returns:

An instance of the AggData class, ready for interpolation.

Return type:

AggData

prep_agg_data(key)[source]#

Prepare data for aggregation operations.

This method subsets the source dataset based on the pre-calculated spatial and temporal bounds and prepares an AggData object.

Parameters:

key (str) – The variable name to prepare for aggregation.

Returns:

An AggData instance ready for aggregation.

Return type:

AggData

prep_wght_data()[source]#

Prepare data for weight generation calculations.

This method subsets the source dataset and generates grid cell polygons required for calculating spatial intersection weights.

Returns:

Data required for weight generation.

Return type:

WeightData

Examples (UserCatData)#

Local NetCDF File#

import xarray as xr
from gdptools.data.user_data import UserCatData

# Load local dataset
ds = xr.open_dataset("climate_data.nc")

data = UserCatData(
    source_ds=ds,
    source_crs=4326,
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["temperature", "precipitation"],
    target_gdf="regions.shp",
    target_crs=4326,
    target_id="region_id",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Prepare for aggregation
for var in data.get_vars():
    agg_data = data.prep_agg_data(var)

Remote Dataset#

# Access remote OPeNDAP dataset
data = UserCatData(
    source_ds="https://example.com/thredds/dodsC/dataset.nc",
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var="temperature",
    target_gdf=polygons_gdf,
    target_crs=3857,  # Web Mercator
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)

Projected Coordinates#

# Work with projected coordinate system
data = UserCatData(
    source_ds="projected_data.nc",
    source_crs=3857,  # Web Mercator
    source_x_coord="x",
    source_y_coord="y",
    source_t_coord="time",
    source_var=["var1", "var2"],
    target_gdf="regions.gpkg",
    target_crs=4326,
    target_id="region_id",
    source_time_period=["2020-06-01", "2020-08-31"]
)

NHGF STAC Catalog (NHGFStacData)#

The NHGF STAC catalog hosts datasets in two storage formats: Zarr (multi-dimensional time-series, e.g. CONUS404) and GeoTIFF (single raster per time step, e.g. NLCD land cover).

NHGFStacData is a factory class that auto-detects which format a STAC collection uses and returns the appropriate concrete class:

Collection format

Returned class

Example datasets

Zarr

NHGFStacZarrData

CONUS404, GridMET, PRISM

GeoTIFF

NHGFStacTiffData

NLCD land cover

Most users should use the factory (NHGFStacData) and let auto-detection handle the rest. The concrete classes can also be imported directly when you know the format in advance.

class NHGFStacData(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, asset_type=None, band=1, bname='band', **kwargs)[source]#

Bases: object

Factory for NHGF STAC catalog datasets.

Auto-detects whether the STAC collection is Zarr-backed or GeoTIFF-backed and returns the appropriate concrete class (NHGFStacZarrData or NHGFStacTiffData).

This preserves backward compatibility: existing code using NHGFStacData(source_collection=zarr_collection, ...) continues to work and receives an NHGFStacZarrData instance.

Examples

Zarr collection (auto-detected):

from gdptools import NHGFStacData
from gdptools.helpers import get_stac_collection

collection = get_stac_collection("conus404_daily")
data = NHGFStacData(
    source_collection=collection,
    source_var=["PWAT"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-01-31"],
)
# data is an NHGFStacZarrData instance

GeoTIFF collection (auto-detected):

collection = get_stac_collection("nlcd-LndCov")
data = NHGFStacData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)
# data is an NHGFStacTiffData instance

Create an NHGF STAC data source with auto-detected format.

Parameters:
  • source_collection – STAC collection object.

  • source_var (str | list[str]) – Variable name(s) to process.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries.

  • target_id (str) – Column name for unique identifiers.

  • source_time_period (list[str | Timestamp | datetime | None] | None) – Time period [start, end]. Required for Zarr; optional for GeoTIFF (selects which item to load).

  • asset_type (str | None) – Optional override — "zarr" or "tiff". If None, auto-detected from the collection.

  • band (int) – Band number for GeoTIFF (1-indexed). Ignored for Zarr.

  • bname (str) – Band dimension name for GeoTIFF. Ignored for Zarr.

  • **kwargs – Additional keyword arguments for backward compatibility with deprecated parameter names.

Returns:

NHGFStacZarrData or NHGFStacTiffData instance.

Return type:

NHGFStacZarrData | NHGFStacTiffData

static __new__(cls, *, source_collection, source_var, target_gdf, target_id, source_time_period=None, asset_type=None, band=1, bname='band', **kwargs)[source]#

Create an NHGF STAC data source with auto-detected format.

Parameters:
  • source_collection – STAC collection object.

  • source_var (str | list[str]) – Variable name(s) to process.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries.

  • target_id (str) – Column name for unique identifiers.

  • source_time_period (list[str | Timestamp | datetime | None] | None) – Time period [start, end]. Required for Zarr; optional for GeoTIFF (selects which item to load).

  • asset_type (str | None) – Optional override — "zarr" or "tiff". If None, auto-detected from the collection.

  • band (int) – Band number for GeoTIFF (1-indexed). Ignored for Zarr.

  • bname (str) – Band dimension name for GeoTIFF. Ignored for Zarr.

  • **kwargs – Additional keyword arguments for backward compatibility with deprecated parameter names.

Returns:

NHGFStacZarrData or NHGFStacTiffData instance.

Return type:

NHGFStacZarrData | NHGFStacTiffData

Examples (NHGFStacData)#

Zarr collection (auto-detected)#

from gdptools import NHGFStacData
from gdptools.helpers import get_stac_collection

collection = get_stac_collection("conus404_daily")
data = NHGFStacData(
    source_collection=collection,
    source_var=["PWAT"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-01-31"],
)
# type(data) is NHGFStacZarrData

GeoTIFF collection (auto-detected)#

collection = get_stac_collection("nlcd-LndCov")
data = NHGFStacData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)
# type(data) is NHGFStacTiffData

Note

source_time_period is required for Zarr collections (continuous time-series must be temporally subset). For GeoTIFF collections it is optional — when provided, it selects which STAC item (year) to load.


Zarr Collections (NHGFStacZarrData)#

NHGFStacZarrData provides access to Zarr-backed datasets in the NHGF STAC catalog. These are multi-dimensional datasets stored as cloud-optimized Zarr archives on S3, with automatic spatiotemporal filtering and metadata handling.

class NHGFStacZarrData(*, source_collection, source_var, target_gdf, target_id, source_time_period)[source]#

Bases: _NHGFStacBase

Interface for Zarr-backed NHGF STAC catalog datasets.

This class provides access to Zarr datasets through the NHGF STAC catalog, with automatic spatiotemporal filtering and metadata handling.

source_collection#

STAC collection identifier for the dataset.

source_var#

Variable name(s) to access from the collection.

target_gdf#

GeoDataFrame containing target geometries.

target_id#

Column name for unique identifiers in target geometries.

source_time_period#

Time period for temporal filtering.

Examples

from gdptools import NHGFStacZarrData
from gdptools.helpers import get_stac_collection

collection = get_stac_collection("conus404_daily")
data = NHGFStacZarrData(
    source_collection=collection,
    source_var=["PWAT"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-01-31"],
)

Initialize NHGFStacZarrData for Zarr-backed STAC collections.

Parameters:
  • source_collection – STAC collection object with a zarr-s3-osn asset.

  • source_var (Union[str, list[str]]) – Variable name(s) to aggregate.

  • target_gdf (Union[str, Path, gpd.GeoDataFrame]) – Target geometries.

  • target_id (str) – Column name in target_gdf containing unique identifiers.

  • source_time_period (list[str]) – Two-element list [start, end] defining the temporal subset ('YYYY-MM-DD' or with time component).

Raises:

KeyError – If target_id is not in target_gdf columns.

__init__(*, source_collection, source_var, target_gdf, target_id, source_time_period)[source]#

Initialize NHGFStacZarrData for Zarr-backed STAC collections.

Parameters:
  • source_collection – STAC collection object with a zarr-s3-osn asset.

  • source_var (Union[str, list[str]]) – Variable name(s) to aggregate.

  • target_gdf (Union[str, Path, gpd.GeoDataFrame]) – Target geometries.

  • target_id (str) – Column name in target_gdf containing unique identifiers.

  • source_time_period (list[str]) – Two-element list [start, end] defining the temporal subset ('YYYY-MM-DD' or with time component).

Raises:

KeyError – If target_id is not in target_gdf columns.

classmethod __repr__()[source]#

Print class name.

get_source_subset(key)[source]#

Get a subset of the STAC data source for a specific variable.

Parameters:

key (str) – Name of the variable to subset.

Returns:

Subsetted dataarray.

Return type:

xr.DataArray

get_class_type()[source]#

Return the type of the data class.

Returns "NHGFStacData" for backward compatibility.

prep_interp_data(key, poly_id)[source]#

Prep AggData from NHGFStacZarrData.

Parameters:
  • key (str) – Name of the xarray gridded data variable.

  • poly_id (Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.

Returns:

An instance of the AggData class, ready for interpolation.

Return type:

AggData

prep_agg_data(key)[source]#

Prep AggData from NHGFStacZarrData.

prep_wght_data()[source]#

Prepare and return WeightData for weight generation.


GeoTIFF Collections (NHGFStacTiffData)#

NHGFStacTiffData provides access to GeoTIFF-backed datasets in the NHGF STAC catalog (e.g., NLCD land cover). Each STAC item contains one or more GeoTIFF assets representing a single time step. The class handles remote S3 access, band selection, and spatial subsetting.

class NHGFStacTiffData(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, band=1, bname='band')[source]#

Bases: _NHGFStacBase

Interface for GeoTIFF-backed NHGF STAC catalog datasets (e.g., NLCD).

This class provides access to GeoTIFF datasets in the NHGF STAC catalog, where assets are stored as individual GeoTIFF files per time step (one STAC item per year). It handles remote S3 access, band selection, and spatial subsetting.

Currently supports single-item (single year) access. Multi-item stacking along a time axis is planned for a future release.

source_collection#

STAC collection for the dataset.

source_var#

Variable name(s) to access.

target_gdf#

GeoDataFrame containing target geometries.

target_id#

Column name for unique identifiers in target geometries.

band#

Selected band number (1-indexed).

Examples

from gdptools import NHGFStacTiffData
from gdptools.helpers import get_stac_collection

collection = get_stac_collection("nlcd-LndCov")
data = NHGFStacTiffData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)

Initialize NHGFStacTiffData for GeoTIFF-backed STAC collections.

Parameters:
  • source_collection – STAC collection whose items have GeoTIFF assets.

  • source_var (Union[str, list[str]]) – Variable name(s) for the raster data.

  • target_gdf (Union[str, Path, gpd.GeoDataFrame]) – Target geometries.

  • target_id (str) – Column name in target_gdf containing unique identifiers.

  • source_time_period (list[str | Timestamp | datetime | None] | None) – Optional two-element list [start, end] to select which STAC item to load by datetime. If None, the first item is used.

  • band (int) – Band number to select (1-indexed). Defaults to 1.

  • bname (str) – Name of the band dimension. Defaults to "band".

Raises:
  • KeyError – If target_id is not in target_gdf columns.

  • ValueError – If no STAC items found or no matching item for the time period.

__init__(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, band=1, bname='band')[source]#

Initialize NHGFStacTiffData for GeoTIFF-backed STAC collections.

Parameters:
  • source_collection – STAC collection whose items have GeoTIFF assets.

  • source_var (Union[str, list[str]]) – Variable name(s) for the raster data.

  • target_gdf (Union[str, Path, gpd.GeoDataFrame]) – Target geometries.

  • target_id (str) – Column name in target_gdf containing unique identifiers.

  • source_time_period (list[str | Timestamp | datetime | None] | None) – Optional two-element list [start, end] to select which STAC item to load by datetime. If None, the first item is used.

  • band (int) – Band number to select (1-indexed). Defaults to 1.

  • bname (str) – Name of the band dimension. Defaults to "band".

Raises:
  • KeyError – If target_id is not in target_gdf columns.

  • ValueError – If no STAC items found or no matching item for the time period.

classmethod __repr__()[source]#

Print class name.

get_source_subset(key)[source]#

Get a spatially subset of the source raster data.

Parameters:

key (str) – Variable name (used for interface consistency).

Returns:

Spatially subsetted DataArray.

Return type:

xr.DataArray

get_class_type()[source]#

Return the type of the data class.

prep_interp_data(key, poly_id)[source]#

Prep AggData from NHGFStacTiffData for interpolation.

Parameters:
  • key (str) – Name of the variable.

  • poly_id (Union[str, int]) – ID of the target geometry.

Returns:

An instance ready for interpolation.

Return type:

AggData

prep_agg_data(key)[source]#

Prep AggData from NHGFStacTiffData for aggregation.

Parameters:

key (str) – Name of the variable.

Returns:

An instance ready for aggregation.

Return type:

AggData

prep_wght_data()[source]#

Prepare weight data.

Not implemented for TIFF-based STAC data. Zonal statistics via ZonalGen or WeightedZonalGen is the primary analysis path.


GeoTIFF Data (UserTiffData)#

The UserTiffData class specializes in handling GeoTIFF and raster data for zonal statistics and spatial analysis operations.

class UserTiffData(source_ds, source_crs, source_x_coord, source_y_coord, target_gdf, target_id, bname='band', band=1, source_var='tiff')[source]#

Bases: UserData

Handler for GeoTIFF and other raster data sources.

This class is optimized for working with raster data sources such as GeoTIFF files, providing specialized functionality for zonal statistics and spatial analysis operations. It handles single and multi-band rasters with automatic band selection and coordinate system validation.

The class is particularly useful for processing elevation data, land cover classifications, and other raster datasets that require zonal statistics calculations over vector geometries.

source_ds#

The source raster data as xarray DataArray or Dataset.

target_gdf#

GeoDataFrame containing target geometries for zonal operations.

target_id#

Column name for unique identifiers in target geometries.

band#

Selected band number for multi-band rasters.

source_var#

Variable name assigned to the raster data.

Examples

Basic elevation processing:

from gdptools.data.user_data import UserTiffData

# Process elevation data
data = UserTiffData(
    source_ds="elevation.tif",
    source_crs=4326,
    source_x_coord="x",
    source_y_coord="y",
    target_gdf="watersheds.shp",
    target_id="huc12"
)

# Prepare for zonal statistics
weight_data = data.prep_wght_data()

Multi-band raster processing:

# Select specific band from multi-band raster
data = UserTiffData(
    source_ds="landcover.tif",
    source_crs=3857,
    source_x_coord="x",
    source_y_coord="y",
    target_gdf=polygons_gdf,
    target_id="poly_id",
    band=3,  # Select band 3
    source_var="landcover_class"
)

In-memory raster data:

import rioxarray as rxr
raster = rxr.open_rasterio("slope.tif")
data = UserTiffData(
    source_ds=raster,
    source_crs=raster.rio.crs,
    source_x_coord="x",
    source_y_coord="y",
    target_gdf="regions.gpkg",
    target_id="region_id"
)

Notes

This class automatically handles band selection and coordinate system validation for raster data. It’s optimized for zonal statistics workflows and integrates seamlessly with the ZonalGen classes.

Initialize UserTiffData for raster data processing.

Parameters:
  • source_ds (str | DataArray | Dataset) – Raster data source as a file path, xarray DataArray, or Dataset.

  • source_crs (str | int | CRS) – Coordinate reference system of the raster data.

  • source_x_coord (str) – Name of the x-coordinate dimension in the raster.

  • source_y_coord (str) – Name of the y-coordinate dimension in the raster.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries as a GeoDataFrame or file path.

  • target_id (str) – Column name in target_gdf for unique identifiers.

  • bname (str) – Name of the band dimension in multi-band rasters. Defaults to “band”.

  • band (int) – Band number to select from a multi-band raster (1-indexed). Defaults to 1.

  • source_var (str) – A name to assign to the raster data variable. Defaults to “tiff”.

Raises:
  • FileNotFoundError – If source_ds file path does not exist.

  • KeyError – If target_id is not found in target_gdf columns.

  • ValueError – If source_crs is invalid or the band number is out of range.

__init__(source_ds, source_crs, source_x_coord, source_y_coord, target_gdf, target_id, bname='band', band=1, source_var='tiff')[source]#

Initialize UserTiffData for raster data processing.

Parameters:
  • source_ds (str | DataArray | Dataset) – Raster data source as a file path, xarray DataArray, or Dataset.

  • source_crs (str | int | CRS) – Coordinate reference system of the raster data.

  • source_x_coord (str) – Name of the x-coordinate dimension in the raster.

  • source_y_coord (str) – Name of the y-coordinate dimension in the raster.

  • target_gdf (str | Path | GeoDataFrame) – Target geometries as a GeoDataFrame or file path.

  • target_id (str) – Column name in target_gdf for unique identifiers.

  • bname (str) – Name of the band dimension in multi-band rasters. Defaults to “band”.

  • band (int) – Band number to select from a multi-band raster (1-indexed). Defaults to 1.

  • source_var (str) – A name to assign to the raster data variable. Defaults to “tiff”.

Raises:
  • FileNotFoundError – If source_ds file path does not exist.

  • KeyError – If target_id is not found in target_gdf columns.

  • ValueError – If source_crs is invalid or the band number is out of range.

get_target_crs()[source]#

Get the coordinate reference system of the target geometries.

Returns:

The coordinate reference system of the target vector data.

Return type:

CRS

get_source_subset(key)[source]#

Get a spatially subset of the source raster data.

This method subsets the source raster based on the buffered bounding box of the target geometries. The key argument is not used for this class but is required for interface consistency.

Parameters:

key (str) – A placeholder argument for interface consistency.

Returns:

A spatially subsetted xarray DataArray.

Return type:

xr.DataArray

get_vars()[source]#

Get the list of available variables.

For UserTiffData, this is typically a single variable name assigned during initialization.

Returns:

A list containing the single variable name.

Return type:

list[str]

get_feature_id()[source]#

Get the identifier column name for target geometries.

prep_wght_data()[source]#

Prepare data for weight generation.

Notes

This method is not yet implemented for UserTiffData. Zonal statistics for rasters are handled by prep_agg_data.

get_class_type()[source]#

Get the type identifier for this data source class.

prep_interp_data(key, poly_id)[source]#

Prepare data for interpolation operations.

This method subsets the source raster data to the bounding box of a specific target geometry and prepares an AggData object for interpolation.

Parameters:
  • key (str) – The variable name to prepare for interpolation.

  • poly_id (int) – The identifier of the target geometry to use for subsetting.

Returns:

An instance ready for interpolation.

Return type:

AggData

prep_agg_data(key)[source]#

Prepare data for aggregation or zonal statistics.

This method subsets the source raster data to the buffered bounding box of the target geometries and prepares an AggData object.

Parameters:

key (str) – The variable name to prepare for aggregation.

Returns:

An instance ready for aggregation.

Return type:

AggData

Raises:

ValueError – If subsetting the raster results in an empty dataset, which can indicate a CRS mismatch or no spatial overlap.

Usage Examples#

Single Band Raster#

from gdptools.data.user_data import UserTiffData
from gdptools.zonal_gen import ZonalGen, WeightedZonalGen

# Process elevation data
data = UserTiffData(
    source_ds="elevation.tif",
    source_crs=4326,
    source_x_coord="x",
    source_y_coord="y",
    target_gdf="watersheds.shp",
    target_id="huc12",
    bname="band",   # name of the band dimension (if present)
    band=1,          # 1-indexed band selection for multi-band rasters
)

# Unweighted zonal statistics (continuous)
zonal = ZonalGen(
    user_data=data,
    zonal_engine="serial",
    zonal_writer="csv",
    out_path="./out",
    file_prefix="elev_zonal",
)
zonal_stats = zonal.calculate_zonal(categorical=False)

# Area-weighted zonal statistics (recommended for geographic CRS)
weighted = WeightedZonalGen(
    user_data=data,
    weight_gen_crs=6931,  # Equal-area CRS recommended
    zonal_engine="parallel",
    zonal_writer="csv",
    out_path="./out",
    file_prefix="elev_weighted",
    jobs=4,  # start modestly; each worker loads the source raster
)
weighted_stats = weighted.calculate_weighted_zonal(categorical=False)

Best Practices#

Data Validation#

  • Always validate inputs: Check that coordinate names exist in datasets

  • Verify CRS compatibility: Ensure source and target CRS are properly specified

  • Test with subsets: Validate workflows with small spatial/temporal subsets first

Performance Optimization#

  • Spatial subsetting: Use target geometry bounds to minimize data loading

  • Temporal filtering: Specify precise time ranges to reduce memory usage

  • Coordinate systems: Use appropriate projections for your study area

  • Engine selection: Choose computational engines based on dataset size and available resources

  • Memory management: Monitor and optimize memory usage during processing


Troubleshooting#

Common Issues#

KeyError: Coordinate not found#

# Check available coordinates
ds = xr.open_dataset("data.nc")
print(f"Available coordinates: {list(ds.coords.keys())}")
print(f"Available variables: {list(ds.data_vars.keys())}")

ValueError: Invalid CRS#

# Verify CRS specification
from pyproj import CRS
try:
    crs = CRS.from_user_input(your_crs)
    print(f"Valid CRS: {crs}")
except Exception as e:
    print(f"Invalid CRS: {e}")

No spatial intersection#

# Check geometry bounds
print(f"Target bounds: {target_gdf.total_bounds}")
print(f"Source bounds: {source_ds.rio.bounds()}")

Note

All data classes automatically handle coordinate reference system validation and spatial intersection checking. If spatial overlap is not detected, the classes will raise informative errors with suggestions for resolution.

Warning

When working with large datasets, be mindful of memory usage. Consider using spatial and temporal subsetting to reduce data volume before processing.


See Also#