User Data Classes#
The user data classes form the foundation of the gdptools package, providing standardized interfaces for accessing and processing geospatial data from diverse sources. These classes abstract the complexity of different data formats, coordinate systems, and access methods, presenting a unified API for spatial analysis workflows.
gdptools provides specialized data classes for common geospatial data sources:
UserData: Abstract base class defining the common interfaceClimRCatData: Climate-R catalog integration with automatic metadata handlingUserCatData: User-provided xarray datasets with flexible configurationNHGFStacData: NHGF STAC catalog factory — auto-detects Zarr vs GeoTIFFNHGFStacZarrData: Zarr-backed STAC collections (e.g., CONUS404)NHGFStacTiffData: GeoTIFF-backed STAC collections (e.g., NLCD)
UserTiffData: GeoTIFF and raster data processing with zonal statistics
All classes implement comprehensive data validation, coordinate reference system handling, and spatial/temporal subsetting capabilities.
Key Features#
Unified Interface#
Consistent API: All data classes implement the same abstract methods
Automatic validation: CRS validation, coordinate name checking, and data compatibility
Error handling: Comprehensive error messages with suggested fixes
Data Source Support#
Catalog integration: Seamless access to Climate-R and NHGF STAC catalogs
File formats: NetCDF, Zarr, GeoTIFF, and other xarray-compatible formats
Web services: OPeNDAP, THREDDS, and other network data sources
Spatial Operations#
Automatic reprojection: Handles CRS transformations between source and target data
Intersection validation: Ensures spatial overlap between datasets and geometries
Efficient subsetting: Spatially subset data to minimize memory usage and processing time
Temporal Handling#
Flexible time formats: Support for various datetime formats and time coordinate names
Period subsetting: Automatic temporal slicing based on user-specified ranges
Calendar handling: Proper handling of different calendar systems
Abstract Base Class (UserData)#
The UserData class defines the common interface that all data source classes must implement. It establishes the contract for data preparation, spatial subsetting, and aggregation operations.
- class UserData[source]#
Bases:
ABCAbstract base class for standardizing geospatial data inputs.
This class defines the common interface that all data source classes must implement. It ensures consistent data preparation, spatial subsetting, and aggregation capabilities across different data sources (catalogs, files, web services).
The class enforces a standardized workflow for data handling: 1. Data loading and validation 2. Coordinate reference system handling 3. Spatial and temporal subsetting 4. Data preparation for weight generation and aggregation
All subclasses must implement the abstract methods to provide source-specific functionality while maintaining interface consistency.
Notes
This is an abstract base class and cannot be instantiated directly. Use one of the concrete subclasses (ClimRCatData, UserCatData, NHGFStacData, UserTiffData) instead.
Initialize the data source.
This method must be implemented by subclasses to handle source-specific initialization requirements.
- abstract __init__()[source]#
Initialize the data source.
This method must be implemented by subclasses to handle source-specific initialization requirements.
- abstract get_target_crs()[source]#
Get the coordinate reference system of the target geometries.
- Returns:
The coordinate reference system of the target vector data.
- Return type:
CRS
- abstract prep_wght_data()[source]#
Prepare data for weight generation calculations.
- Returns:
A WeightData instance containing the necessary data for calculating spatial intersection weights between source and target geometries.
- Return type:
WeightData
- abstract prep_interp_data(key, poly_id)[source]#
Prepare data for interpolation operations.
- Parameters:
key (
str) – Variable name or identifier for the data to prepare.poly_id (
Union[str, int]) – Identifier for the target polygon geometry.
- Returns:
An AggData instance configured for interpolation operations.
- Return type:
AggData
- abstract prep_agg_data(key)[source]#
Prepare data for aggregation operations.
- Parameters:
key (
str) – Variable name or identifier for the data to prepare.- Returns:
An AggData instance configured for aggregation operations.
- Return type:
AggData
Climate-R Catalog (ClimRCatData)#
The ClimRCatData class provides seamless integration with the Climate-R catalog system, offering access to a wide range of climate datasets with automatic metadata handling.
- class ClimRCatData(*, source_cat_dict, target_gdf, target_id, source_time_period, **kwargs)[source]#
Bases:
UserDataInterface for Climate-R catalog datasets with automatic metadata handling.
This class provides seamless integration with the Climate-R catalog system, developed by Mike Johnson and available at mikejohnson51/climateR-catalogs. It automatically handles metadata extraction, coordinate system detection, and spatial/temporal subsetting for catalog-based datasets.
The class accepts Climate-R catalog dictionaries and automatically configures data access parameters, coordinate names, and temporal ranges based on the catalog metadata.
- source_cat_dict#
Dictionary mapping variable names to Climate-R catalog metadata.
- target_gdf#
GeoDataFrame containing target geometries for spatial operations.
- target_id#
Column name for unique identifiers in target geometries.
- source_time_period#
Processed time period for temporal subsetting.
Examples
Basic usage with Climate-R catalog:
import pandas as pd from gdptools.data.user_data import ClimRCatData # Load Climate-R catalog cat_url = "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet" cat = pd.read_parquet(cat_url) # Create catalog dictionary for TerraClimate variables cat_vars = ["aet", "pet", "PDSI"] cat_params = [ cat.query("id == 'terraclim' & variable == @var").to_dict("records")[0] for var in cat_vars ] source_cat_dict = dict(zip(cat_vars, cat_params)) # Initialize data source data = ClimRCatData( source_cat_dict=source_cat_dict, target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"] ) # Access data variables = data.get_vars() subset = data.get_source_subset("aet")
Multiple variables from GridMET:
# GridMET temperature and precipitation gm_vars = ["tmmn", "tmmx", "pr"] gm_params = [ cat.query("id == 'gridmet' & variable == @var").to_dict("records")[0] for var in gm_vars ] source_cat_dict = dict(zip(gm_vars, gm_params)) data = ClimRCatData( source_cat_dict=source_cat_dict, target_gdf=target_polygons, target_id="poly_id", source_time_period=["2019-01-01", "2021-12-31"] )
Initialize ClimRCatData for Climate-R catalog integration.
Sets up data access for Climate-R catalog datasets with automatic metadata handling, coordinate system detection, and spatial/temporal validation.
- Parameters:
source_cat_dict (dict[str, dict[str, Any]]) – Dictionary mapping variable names to Climate-R catalog metadata dictionaries. Each entry should contain the complete catalog record for a variable, including URL, coordinate names, CRS, and temporal information.
target_gdf (str | Path | GeoDataFrame) – GeoDataFrame containing target geometries, or path to a shapefile/GeoPackage that can be read by geopandas.
target_id (str) – Column name in target_gdf to use as unique identifier for geometries in weight calculations and aggregations.
source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining the temporal range for data subsetting. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or [“YYYY-MM-DD HH:MM:SS”, “YYYY-MM-DD HH:MM:SS”].
- Raises:
KeyError – If target_id is not found in target_gdf columns.
ValueError – If source_cat_dict is empty or contains invalid entries.
TypeError – If catalog entries are missing required metadata fields.
Notes
The catalog dictionary structure should match the Climate-R catalog format with fields like ‘URL’, ‘X_name’, ‘Y_name’, ‘T_name’, ‘crs’, ‘toptobottom’, etc. Invalid or incomplete entries will raise errors during initialization.
Examples
Initialize with TerraClimate data:
import pandas as pd cat_url = ( "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet" ) cat = pd.read_parquet(cat_url) # Create catalog entry for actual evapotranspiration aet_params = ( cat.query("id == 'terraclim' & variable == 'aet'").to_dict("records")[0] ) source_cat_dict = {"aet": aet_params} data = ClimRCatData( source_cat_dict=source_cat_dict, target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"], )
- __init__(*, source_cat_dict, target_gdf, target_id, source_time_period, **kwargs)[source]#
Initialize ClimRCatData for Climate-R catalog integration.
Sets up data access for Climate-R catalog datasets with automatic metadata handling, coordinate system detection, and spatial/temporal validation.
- Parameters:
source_cat_dict (dict[str, dict[str, Any]]) – Dictionary mapping variable names to Climate-R catalog metadata dictionaries. Each entry should contain the complete catalog record for a variable, including URL, coordinate names, CRS, and temporal information.
target_gdf (str | Path | GeoDataFrame) – GeoDataFrame containing target geometries, or path to a shapefile/GeoPackage that can be read by geopandas.
target_id (str) – Column name in target_gdf to use as unique identifier for geometries in weight calculations and aggregations.
source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining the temporal range for data subsetting. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or [“YYYY-MM-DD HH:MM:SS”, “YYYY-MM-DD HH:MM:SS”].
- Raises:
KeyError – If target_id is not found in target_gdf columns.
ValueError – If source_cat_dict is empty or contains invalid entries.
TypeError – If catalog entries are missing required metadata fields.
Notes
The catalog dictionary structure should match the Climate-R catalog format with fields like ‘URL’, ‘X_name’, ‘Y_name’, ‘T_name’, ‘crs’, ‘toptobottom’, etc. Invalid or incomplete entries will raise errors during initialization.
Examples
Initialize with TerraClimate data:
import pandas as pd cat_url = ( "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet" ) cat = pd.read_parquet(cat_url) # Create catalog entry for actual evapotranspiration aet_params = ( cat.query("id == 'terraclim' & variable == 'aet'").to_dict("records")[0] ) source_cat_dict = {"aet": aet_params} data = ClimRCatData( source_cat_dict=source_cat_dict, target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"], )
- get_target_crs()[source]#
Get the coordinate reference system of the target geometries.
- Returns:
The coordinate reference system of the target vector data.
- Return type:
CRS
- get_source_subset(key)[source]#
Get a spatially and temporally subset of the source data.
This method uses the metadata from the Climate-R catalog to retrieve and subset the data for a specific variable.
- Parameters:
key (
str) – The variable name to subset from the catalog.- Returns:
A subsetted xarray DataArray.
- Return type:
xr.DataArray
- prep_interp_data(key, poly_id)[source]#
Prep AggData from ClimRCatData.
- Parameters:
key (
str) – Name of the xarray gridded data variable.poly_id (
Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.
- Returns:
An instance of the AggData class, ready for interpolation.
- Return type:
AggData
- prep_agg_data(key)[source]#
Prepare ClimRCatData data for aggregation methods.
- Parameters:
key (
str) – The variable name to prepare for aggregation.- Returns:
An AggData instance ready for aggregation.
- Return type:
AggData
Examples (ClimRCatData)#
Basic Climate-R Usage#
import pandas as pd
from gdptools.data.user_data import ClimRCatData
# Load Climate-R catalog
cat_url = "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
cat = pd.read_parquet(cat_url)
# Query for GridMET temperature data
temp_params = cat.query("id == 'gridmet' & variable == 'tmmn'").to_dict("records")[0]
source_cat_dict = {"tmmn": temp_params}
# Initialize data source
data = ClimRCatData(
source_cat_dict=source_cat_dict,
target_gdf="watersheds.shp",
target_id="huc12",
source_time_period=["2020-01-01", "2020-12-31"]
)
# Access data
variables = data.get_vars()
subset = data.get_source_subset("tmmn")
Multiple Variables#
# gridMET climate variables
gridmet_vars = ["tmmx", "tmmn", "pr"]
gridmet_params = [
cat.query("id == 'gridmet' & variable == @var").to_dict("records")[0]
for var in gridmet_vars
]
source_cat_dict = dict(zip(gridmet_vars, gridmet_params))
data = ClimRCatData(
source_cat_dict=source_cat_dict,
target_gdf=watersheds_gdf,
target_id="watershed_id",
source_time_period=["2015-01-01", "2021-12-31"]
)
# Process all variables
for var in data.get_vars():
agg_data = data.prep_agg_data(var)
print(f"Prepared {var} for aggregation")
Custom Datasets (UserCatData)#
The UserCatData class handles user-provided xarray datasets with full control over coordinate names, variable selection, and coordinate reference systems.
- class UserCatData(*, source_ds, source_crs, source_x_coord, source_y_coord, source_t_coord, source_var, target_gdf, target_crs, target_id, source_time_period)[source]#
Bases:
UserDataHandler for user-provided xarray datasets with custom configuration.
This class provides a flexible interface for working with user-supplied gridded datasets that are not available through catalogs. It handles xarray datasets from local files, URLs, or in-memory objects, with full control over coordinate names, variable selection, and coordinate reference systems.
The class performs comprehensive validation of input parameters and data compatibility, ensuring that coordinate names exist, variables are available, and coordinate reference systems are valid.
- source_ds#
The source xarray Dataset containing gridded data.
- target_gdf#
GeoDataFrame containing target geometries.
- target_id#
Column name for unique identifiers in target geometries.
- source_time_period#
Processed time period for temporal subsetting.
Examples
Basic usage with local NetCDF file:
import xarray as xr from gdptools.data.user_data import UserCatData # Load custom dataset ds = xr.open_dataset("climate_data.nc") data = UserCatData( source_ds=ds, source_crs=4326, source_x_coord="longitude", source_y_coord="latitude", source_t_coord="time", source_var=["temperature", "precipitation"], target_gdf="watersheds.shp", target_crs=4326, target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"] )
Using URL data source:
# Remote dataset data = UserCatData( source_ds="https://example.com/climate.nc", source_crs="EPSG:4326", source_x_coord="lon", source_y_coord="lat", source_t_coord="time", source_var="temperature", target_gdf=polygons_gdf, target_crs=3857, target_id="poly_id", source_time_period=["2019-01-01", "2021-12-31"] )
Multiple variables with different CRS:
# Projected dataset data = UserCatData( source_ds="projected_data.nc", source_crs=3857, # Web Mercator source_x_coord="x", source_y_coord="y", source_t_coord="time", source_var=["var1", "var2", "var3"], target_gdf="regions.gpkg", target_crs=4326, target_id="region_id", source_time_period=["2020-06-01", "2020-08-31"] )
Initialize UserCatData for custom gridded datasets.
Sets up data access for user-provided xarray datasets with comprehensive validation of coordinates, variables, and coordinate reference systems.
- Parameters:
source_ds (str | Dataset) – Source dataset as xarray Dataset, file path, or URL. Can be any data source readable by xarray.open_dataset().
source_crs (str | int | CRS) – Coordinate reference system for the source dataset. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.
source_x_coord (str) – Name of the x-coordinate dimension in source_ds. Must exist in dataset coordinates.
source_y_coord (str) – Name of the y-coordinate dimension in source_ds. Must exist in dataset coordinates.
source_t_coord (str) – Name of the time coordinate dimension in source_ds. Must exist in dataset coordinates.
source_var (str | list[str]) – Variable name(s) to use for processing. Can be a single string or list of strings. All variables must exist in source_ds.
target_gdf (str | Path | GeoDataFrame) – Target geometries as GeoDataFrame or file path. Can be any format readable by geopandas.read_file().
target_crs (str | int | CRS) – Coordinate reference system for target geometries. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.
target_id (str) – Column name in target_gdf to use as unique identifier. Must exist in target_gdf columns.
source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining temporal range. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or with time stamps.
- Raises:
KeyError – If target_id is not in target_gdf columns, or if coordinate names or variables are not found in the dataset.
ValueError – If source_crs or target_crs are invalid CRS specifications.
FileNotFoundError – If source_ds or target_gdf file paths don’t exist.
Note
This class performs extensive validation upon initialization, including checking for the existence of specified coordinates and variables, validating CRS definitions, and ensuring the target ID exists in the geometries.
Examples
Initialize with local NetCDF file:
import xarray as xr data = UserCatData( source_ds="temperature_data.nc", source_crs=4326, source_x_coord="longitude", source_y_coord="latitude", source_t_coord="time", source_var=["temperature"], target_gdf="watersheds.shp", target_crs=4326, target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"] )
Initialize with in-memory dataset:
ds = xr.open_dataset("climate.nc") data = UserCatData( source_ds=ds, source_crs="EPSG:4326", source_x_coord="lon", source_y_coord="lat", source_t_coord="time", source_var=["temp", "precip"], target_gdf=polygons_gdf, target_crs=3857, target_id="poly_id", source_time_period=["2019-01-01", "2021-12-31"] )
- __init__(*, source_ds, source_crs, source_x_coord, source_y_coord, source_t_coord, source_var, target_gdf, target_crs, target_id, source_time_period)[source]#
Initialize UserCatData for custom gridded datasets.
Sets up data access for user-provided xarray datasets with comprehensive validation of coordinates, variables, and coordinate reference systems.
- Parameters:
source_ds (str | Dataset) – Source dataset as xarray Dataset, file path, or URL. Can be any data source readable by xarray.open_dataset().
source_crs (str | int | CRS) – Coordinate reference system for the source dataset. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.
source_x_coord (str) – Name of the x-coordinate dimension in source_ds. Must exist in dataset coordinates.
source_y_coord (str) – Name of the y-coordinate dimension in source_ds. Must exist in dataset coordinates.
source_t_coord (str) – Name of the time coordinate dimension in source_ds. Must exist in dataset coordinates.
source_var (str | list[str]) – Variable name(s) to use for processing. Can be a single string or list of strings. All variables must exist in source_ds.
target_gdf (str | Path | GeoDataFrame) – Target geometries as GeoDataFrame or file path. Can be any format readable by geopandas.read_file().
target_crs (str | int | CRS) – Coordinate reference system for target geometries. Can be EPSG code, proj4 string, WKT, or pyproj CRS object.
target_id (str) – Column name in target_gdf to use as unique identifier. Must exist in target_gdf columns.
source_time_period (list[str | Timestamp | datetime | None]) – Two-element list defining temporal range. Format: [“YYYY-MM-DD”, “YYYY-MM-DD”] or with time stamps.
- Raises:
KeyError – If target_id is not in target_gdf columns, or if coordinate names or variables are not found in the dataset.
ValueError – If source_crs or target_crs are invalid CRS specifications.
FileNotFoundError – If source_ds or target_gdf file paths don’t exist.
Note
This class performs extensive validation upon initialization, including checking for the existence of specified coordinates and variables, validating CRS definitions, and ensuring the target ID exists in the geometries.
Examples
Initialize with local NetCDF file:
import xarray as xr data = UserCatData( source_ds="temperature_data.nc", source_crs=4326, source_x_coord="longitude", source_y_coord="latitude", source_t_coord="time", source_var=["temperature"], target_gdf="watersheds.shp", target_crs=4326, target_id="huc12", source_time_period=["2020-01-01", "2020-12-31"] )
Initialize with in-memory dataset:
ds = xr.open_dataset("climate.nc") data = UserCatData( source_ds=ds, source_crs="EPSG:4326", source_x_coord="lon", source_y_coord="lat", source_t_coord="time", source_var=["temp", "precip"], target_gdf=polygons_gdf, target_crs=3857, target_id="poly_id", source_time_period=["2019-01-01", "2021-12-31"] )
- get_target_crs()[source]#
Return the coordinate reference system (CRS) for the source data.
This method provides the CRS used by the target geometries.
- get_source_subset(key)[source]#
Get a spatially and temporally subset of the source dataset.
This method applies the pre-calculated spatial and temporal subset dictionary to the source dataset for the given variable.
- Parameters:
key (
str) – Name of the xarray gridded data variable.- Returns:
A subsetted xarray DataArray of the original source gridded data.
- Return type:
xr.DataArray
- prep_interp_data(key, poly_id)[source]#
Prep AggData from UserCatData.
- Parameters:
key (
str) – Name of the xarray gridded data variable.poly_id (
Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.
- Returns:
An instance of the AggData class, ready for interpolation.
- Return type:
AggData
- prep_agg_data(key)[source]#
Prepare data for aggregation operations.
This method subsets the source dataset based on the pre-calculated spatial and temporal bounds and prepares an AggData object.
- Parameters:
key (
str) – The variable name to prepare for aggregation.- Returns:
An AggData instance ready for aggregation.
- Return type:
AggData
Examples (UserCatData)#
Local NetCDF File#
import xarray as xr
from gdptools.data.user_data import UserCatData
# Load local dataset
ds = xr.open_dataset("climate_data.nc")
data = UserCatData(
source_ds=ds,
source_crs=4326,
source_x_coord="longitude",
source_y_coord="latitude",
source_t_coord="time",
source_var=["temperature", "precipitation"],
target_gdf="regions.shp",
target_crs=4326,
target_id="region_id",
source_time_period=["2020-01-01", "2020-12-31"]
)
# Prepare for aggregation
for var in data.get_vars():
agg_data = data.prep_agg_data(var)
Remote Dataset#
# Access remote OPeNDAP dataset
data = UserCatData(
source_ds="https://example.com/thredds/dodsC/dataset.nc",
source_crs="EPSG:4326",
source_x_coord="lon",
source_y_coord="lat",
source_t_coord="time",
source_var="temperature",
target_gdf=polygons_gdf,
target_crs=3857, # Web Mercator
target_id="poly_id",
source_time_period=["2019-01-01", "2021-12-31"]
)
Projected Coordinates#
# Work with projected coordinate system
data = UserCatData(
source_ds="projected_data.nc",
source_crs=3857, # Web Mercator
source_x_coord="x",
source_y_coord="y",
source_t_coord="time",
source_var=["var1", "var2"],
target_gdf="regions.gpkg",
target_crs=4326,
target_id="region_id",
source_time_period=["2020-06-01", "2020-08-31"]
)
NHGF STAC Catalog (NHGFStacData)#
The NHGF STAC catalog hosts datasets in two storage formats: Zarr (multi-dimensional time-series, e.g. CONUS404) and GeoTIFF (single raster per time step, e.g. NLCD land cover).
NHGFStacData is a factory class that auto-detects which format a STAC collection uses
and returns the appropriate concrete class:
Collection format |
Returned class |
Example datasets |
|---|---|---|
Zarr |
|
CONUS404, GridMET, PRISM |
GeoTIFF |
|
NLCD land cover |
Most users should use the factory (NHGFStacData) and let auto-detection handle the rest.
The concrete classes can also be imported directly when you know the format in advance.
- class NHGFStacData(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, asset_type=None, band=1, bname='band', **kwargs)[source]#
Bases:
objectFactory for NHGF STAC catalog datasets.
Auto-detects whether the STAC collection is Zarr-backed or GeoTIFF-backed and returns the appropriate concrete class (
NHGFStacZarrDataorNHGFStacTiffData).This preserves backward compatibility: existing code using
NHGFStacData(source_collection=zarr_collection, ...)continues to work and receives anNHGFStacZarrDatainstance.Examples
Zarr collection (auto-detected):
from gdptools import NHGFStacData from gdptools.helpers import get_stac_collection collection = get_stac_collection("conus404_daily") data = NHGFStacData( source_collection=collection, source_var=["PWAT"], target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2020-01-01", "2020-01-31"], ) # data is an NHGFStacZarrData instance
GeoTIFF collection (auto-detected):
collection = get_stac_collection("nlcd-LndCov") data = NHGFStacData( source_collection=collection, source_var=["LndCov"], target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2021-01-01", "2021-12-31"], ) # data is an NHGFStacTiffData instance
Create an NHGF STAC data source with auto-detected format.
- Parameters:
source_collection – STAC collection object.
target_gdf (str | Path | GeoDataFrame) – Target geometries.
target_id (str) – Column name for unique identifiers.
source_time_period (list[str | Timestamp | datetime | None] | None) – Time period
[start, end]. Required for Zarr; optional for GeoTIFF (selects which item to load).asset_type (str | None) – Optional override —
"zarr"or"tiff". IfNone, auto-detected from the collection.band (int) – Band number for GeoTIFF (1-indexed). Ignored for Zarr.
bname (str) – Band dimension name for GeoTIFF. Ignored for Zarr.
**kwargs – Additional keyword arguments for backward compatibility with deprecated parameter names.
- Returns:
NHGFStacZarrData or NHGFStacTiffData instance.
- Return type:
- static __new__(cls, *, source_collection, source_var, target_gdf, target_id, source_time_period=None, asset_type=None, band=1, bname='band', **kwargs)[source]#
Create an NHGF STAC data source with auto-detected format.
- Parameters:
source_collection – STAC collection object.
target_gdf (str | Path | GeoDataFrame) – Target geometries.
target_id (str) – Column name for unique identifiers.
source_time_period (list[str | Timestamp | datetime | None] | None) – Time period
[start, end]. Required for Zarr; optional for GeoTIFF (selects which item to load).asset_type (str | None) – Optional override —
"zarr"or"tiff". IfNone, auto-detected from the collection.band (int) – Band number for GeoTIFF (1-indexed). Ignored for Zarr.
bname (str) – Band dimension name for GeoTIFF. Ignored for Zarr.
**kwargs – Additional keyword arguments for backward compatibility with deprecated parameter names.
- Returns:
NHGFStacZarrData or NHGFStacTiffData instance.
- Return type:
Examples (NHGFStacData)#
Zarr collection (auto-detected)#
from gdptools import NHGFStacData
from gdptools.helpers import get_stac_collection
collection = get_stac_collection("conus404_daily")
data = NHGFStacData(
source_collection=collection,
source_var=["PWAT"],
target_gdf="watersheds.shp",
target_id="huc12",
source_time_period=["2020-01-01", "2020-01-31"],
)
# type(data) is NHGFStacZarrData
GeoTIFF collection (auto-detected)#
collection = get_stac_collection("nlcd-LndCov")
data = NHGFStacData(
source_collection=collection,
source_var=["LndCov"],
target_gdf="watersheds.shp",
target_id="huc12",
source_time_period=["2021-01-01", "2021-12-31"],
)
# type(data) is NHGFStacTiffData
Note
source_time_period is required for Zarr collections (continuous time-series must be
temporally subset). For GeoTIFF collections it is optional — when provided, it selects
which STAC item (year) to load.
Zarr Collections (NHGFStacZarrData)#
NHGFStacZarrData provides access to Zarr-backed datasets in the NHGF STAC catalog.
These are multi-dimensional datasets stored as cloud-optimized Zarr archives on S3, with
automatic spatiotemporal filtering and metadata handling.
- class NHGFStacZarrData(*, source_collection, source_var, target_gdf, target_id, source_time_period)[source]#
Bases:
_NHGFStacBaseInterface for Zarr-backed NHGF STAC catalog datasets.
This class provides access to Zarr datasets through the NHGF STAC catalog, with automatic spatiotemporal filtering and metadata handling.
- source_collection#
STAC collection identifier for the dataset.
- source_var#
Variable name(s) to access from the collection.
- target_gdf#
GeoDataFrame containing target geometries.
- target_id#
Column name for unique identifiers in target geometries.
- source_time_period#
Time period for temporal filtering.
Examples
from gdptools import NHGFStacZarrData from gdptools.helpers import get_stac_collection collection = get_stac_collection("conus404_daily") data = NHGFStacZarrData( source_collection=collection, source_var=["PWAT"], target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2020-01-01", "2020-01-31"], )
Initialize NHGFStacZarrData for Zarr-backed STAC collections.
- Parameters:
source_collection – STAC collection object with a
zarr-s3-osnasset.source_var (
Union[str, list[str]]) – Variable name(s) to aggregate.target_gdf (
Union[str, Path, gpd.GeoDataFrame]) – Target geometries.target_id (
str) – Column name in target_gdf containing unique identifiers.source_time_period (
list[str]) – Two-element list[start, end]defining the temporal subset ('YYYY-MM-DD'or with time component).
- Raises:
KeyError – If target_id is not in target_gdf columns.
- __init__(*, source_collection, source_var, target_gdf, target_id, source_time_period)[source]#
Initialize NHGFStacZarrData for Zarr-backed STAC collections.
- Parameters:
source_collection – STAC collection object with a
zarr-s3-osnasset.source_var (
Union[str, list[str]]) – Variable name(s) to aggregate.target_gdf (
Union[str, Path, gpd.GeoDataFrame]) – Target geometries.target_id (
str) – Column name in target_gdf containing unique identifiers.source_time_period (
list[str]) – Two-element list[start, end]defining the temporal subset ('YYYY-MM-DD'or with time component).
- Raises:
KeyError – If target_id is not in target_gdf columns.
- get_source_subset(key)[source]#
Get a subset of the STAC data source for a specific variable.
- Parameters:
key (
str) – Name of the variable to subset.- Returns:
Subsetted dataarray.
- Return type:
xr.DataArray
- get_class_type()[source]#
Return the type of the data class.
Returns
"NHGFStacData"for backward compatibility.
- prep_interp_data(key, poly_id)[source]#
Prep AggData from NHGFStacZarrData.
- Parameters:
key (
str) – Name of the xarray gridded data variable.poly_id (
Union[str, int]) – ID number of the geodataframe geometry to clip the gridded data to.
- Returns:
An instance of the AggData class, ready for interpolation.
- Return type:
AggData
GeoTIFF Collections (NHGFStacTiffData)#
NHGFStacTiffData provides access to GeoTIFF-backed datasets in the NHGF STAC catalog
(e.g., NLCD land cover). Each STAC item contains one or more GeoTIFF assets representing
a single time step. The class handles remote S3 access, band selection, and spatial subsetting.
- class NHGFStacTiffData(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, band=1, bname='band')[source]#
Bases:
_NHGFStacBaseInterface for GeoTIFF-backed NHGF STAC catalog datasets (e.g., NLCD).
This class provides access to GeoTIFF datasets in the NHGF STAC catalog, where assets are stored as individual GeoTIFF files per time step (one STAC item per year). It handles remote S3 access, band selection, and spatial subsetting.
Currently supports single-item (single year) access. Multi-item stacking along a time axis is planned for a future release.
- source_collection#
STAC collection for the dataset.
- source_var#
Variable name(s) to access.
- target_gdf#
GeoDataFrame containing target geometries.
- target_id#
Column name for unique identifiers in target geometries.
- band#
Selected band number (1-indexed).
Examples
from gdptools import NHGFStacTiffData from gdptools.helpers import get_stac_collection collection = get_stac_collection("nlcd-LndCov") data = NHGFStacTiffData( source_collection=collection, source_var=["LndCov"], target_gdf="watersheds.shp", target_id="huc12", source_time_period=["2021-01-01", "2021-12-31"], )
Initialize NHGFStacTiffData for GeoTIFF-backed STAC collections.
- Parameters:
source_collection – STAC collection whose items have GeoTIFF assets.
source_var (
Union[str, list[str]]) – Variable name(s) for the raster data.target_gdf (
Union[str, Path, gpd.GeoDataFrame]) – Target geometries.target_id (
str) – Column name in target_gdf containing unique identifiers.source_time_period (list[str | Timestamp | datetime | None] | None) – Optional two-element list
[start, end]to select which STAC item to load by datetime. IfNone, the first item is used.band (
int) – Band number to select (1-indexed). Defaults to 1.bname (
str) – Name of the band dimension. Defaults to"band".
- Raises:
KeyError – If target_id is not in target_gdf columns.
ValueError – If no STAC items found or no matching item for the time period.
- __init__(*, source_collection, source_var, target_gdf, target_id, source_time_period=None, band=1, bname='band')[source]#
Initialize NHGFStacTiffData for GeoTIFF-backed STAC collections.
- Parameters:
source_collection – STAC collection whose items have GeoTIFF assets.
source_var (
Union[str, list[str]]) – Variable name(s) for the raster data.target_gdf (
Union[str, Path, gpd.GeoDataFrame]) – Target geometries.target_id (
str) – Column name in target_gdf containing unique identifiers.source_time_period (list[str | Timestamp | datetime | None] | None) – Optional two-element list
[start, end]to select which STAC item to load by datetime. IfNone, the first item is used.band (
int) – Band number to select (1-indexed). Defaults to 1.bname (
str) – Name of the band dimension. Defaults to"band".
- Raises:
KeyError – If target_id is not in target_gdf columns.
ValueError – If no STAC items found or no matching item for the time period.
- get_source_subset(key)[source]#
Get a spatially subset of the source raster data.
- Parameters:
key (
str) – Variable name (used for interface consistency).- Returns:
Spatially subsetted DataArray.
- Return type:
xr.DataArray
- prep_interp_data(key, poly_id)[source]#
Prep AggData from NHGFStacTiffData for interpolation.
- Parameters:
key (
str) – Name of the variable.poly_id (
Union[str, int]) – ID of the target geometry.
- Returns:
An instance ready for interpolation.
- Return type:
AggData
GeoTIFF Data (UserTiffData)#
The UserTiffData class specializes in handling GeoTIFF and raster data for zonal statistics and spatial analysis operations.
- class UserTiffData(source_ds, source_crs, source_x_coord, source_y_coord, target_gdf, target_id, bname='band', band=1, source_var='tiff')[source]#
Bases:
UserDataHandler for GeoTIFF and other raster data sources.
This class is optimized for working with raster data sources such as GeoTIFF files, providing specialized functionality for zonal statistics and spatial analysis operations. It handles single and multi-band rasters with automatic band selection and coordinate system validation.
The class is particularly useful for processing elevation data, land cover classifications, and other raster datasets that require zonal statistics calculations over vector geometries.
- source_ds#
The source raster data as xarray DataArray or Dataset.
- target_gdf#
GeoDataFrame containing target geometries for zonal operations.
- target_id#
Column name for unique identifiers in target geometries.
- band#
Selected band number for multi-band rasters.
- source_var#
Variable name assigned to the raster data.
Examples
Basic elevation processing:
from gdptools.data.user_data import UserTiffData # Process elevation data data = UserTiffData( source_ds="elevation.tif", source_crs=4326, source_x_coord="x", source_y_coord="y", target_gdf="watersheds.shp", target_id="huc12" ) # Prepare for zonal statistics weight_data = data.prep_wght_data()
Multi-band raster processing:
# Select specific band from multi-band raster data = UserTiffData( source_ds="landcover.tif", source_crs=3857, source_x_coord="x", source_y_coord="y", target_gdf=polygons_gdf, target_id="poly_id", band=3, # Select band 3 source_var="landcover_class" )
In-memory raster data:
import rioxarray as rxr raster = rxr.open_rasterio("slope.tif") data = UserTiffData( source_ds=raster, source_crs=raster.rio.crs, source_x_coord="x", source_y_coord="y", target_gdf="regions.gpkg", target_id="region_id" )
Notes
This class automatically handles band selection and coordinate system validation for raster data. It’s optimized for zonal statistics workflows and integrates seamlessly with the ZonalGen classes.
Initialize UserTiffData for raster data processing.
- Parameters:
source_ds (str | DataArray | Dataset) – Raster data source as a file path, xarray DataArray, or Dataset.
source_crs (str | int | CRS) – Coordinate reference system of the raster data.
source_x_coord (str) – Name of the x-coordinate dimension in the raster.
source_y_coord (str) – Name of the y-coordinate dimension in the raster.
target_gdf (str | Path | GeoDataFrame) – Target geometries as a GeoDataFrame or file path.
target_id (str) – Column name in target_gdf for unique identifiers.
bname (str) – Name of the band dimension in multi-band rasters. Defaults to “band”.
band (int) – Band number to select from a multi-band raster (1-indexed). Defaults to 1.
source_var (str) – A name to assign to the raster data variable. Defaults to “tiff”.
- Raises:
FileNotFoundError – If source_ds file path does not exist.
KeyError – If target_id is not found in target_gdf columns.
ValueError – If source_crs is invalid or the band number is out of range.
- __init__(source_ds, source_crs, source_x_coord, source_y_coord, target_gdf, target_id, bname='band', band=1, source_var='tiff')[source]#
Initialize UserTiffData for raster data processing.
- Parameters:
source_ds (str | DataArray | Dataset) – Raster data source as a file path, xarray DataArray, or Dataset.
source_crs (str | int | CRS) – Coordinate reference system of the raster data.
source_x_coord (str) – Name of the x-coordinate dimension in the raster.
source_y_coord (str) – Name of the y-coordinate dimension in the raster.
target_gdf (str | Path | GeoDataFrame) – Target geometries as a GeoDataFrame or file path.
target_id (str) – Column name in target_gdf for unique identifiers.
bname (str) – Name of the band dimension in multi-band rasters. Defaults to “band”.
band (int) – Band number to select from a multi-band raster (1-indexed). Defaults to 1.
source_var (str) – A name to assign to the raster data variable. Defaults to “tiff”.
- Raises:
FileNotFoundError – If source_ds file path does not exist.
KeyError – If target_id is not found in target_gdf columns.
ValueError – If source_crs is invalid or the band number is out of range.
- get_target_crs()[source]#
Get the coordinate reference system of the target geometries.
- Returns:
The coordinate reference system of the target vector data.
- Return type:
CRS
- get_source_subset(key)[source]#
Get a spatially subset of the source raster data.
This method subsets the source raster based on the buffered bounding box of the target geometries. The key argument is not used for this class but is required for interface consistency.
- Parameters:
key (
str) – A placeholder argument for interface consistency.- Returns:
A spatially subsetted xarray DataArray.
- Return type:
xr.DataArray
- get_vars()[source]#
Get the list of available variables.
For UserTiffData, this is typically a single variable name assigned during initialization.
- prep_wght_data()[source]#
Prepare data for weight generation.
Notes
This method is not yet implemented for UserTiffData. Zonal statistics for rasters are handled by prep_agg_data.
- prep_interp_data(key, poly_id)[source]#
Prepare data for interpolation operations.
This method subsets the source raster data to the bounding box of a specific target geometry and prepares an AggData object for interpolation.
- prep_agg_data(key)[source]#
Prepare data for aggregation or zonal statistics.
This method subsets the source raster data to the buffered bounding box of the target geometries and prepares an AggData object.
- Parameters:
key (
str) – The variable name to prepare for aggregation.- Returns:
An instance ready for aggregation.
- Return type:
AggData
- Raises:
ValueError – If subsetting the raster results in an empty dataset, which can indicate a CRS mismatch or no spatial overlap.
Usage Examples#
Single Band Raster#
from gdptools.data.user_data import UserTiffData
from gdptools.zonal_gen import ZonalGen, WeightedZonalGen
# Process elevation data
data = UserTiffData(
source_ds="elevation.tif",
source_crs=4326,
source_x_coord="x",
source_y_coord="y",
target_gdf="watersheds.shp",
target_id="huc12",
bname="band", # name of the band dimension (if present)
band=1, # 1-indexed band selection for multi-band rasters
)
# Unweighted zonal statistics (continuous)
zonal = ZonalGen(
user_data=data,
zonal_engine="serial",
zonal_writer="csv",
out_path="./out",
file_prefix="elev_zonal",
)
zonal_stats = zonal.calculate_zonal(categorical=False)
# Area-weighted zonal statistics (recommended for geographic CRS)
weighted = WeightedZonalGen(
user_data=data,
weight_gen_crs=6931, # Equal-area CRS recommended
zonal_engine="parallel",
zonal_writer="csv",
out_path="./out",
file_prefix="elev_weighted",
jobs=4, # start modestly; each worker loads the source raster
)
weighted_stats = weighted.calculate_weighted_zonal(categorical=False)
Best Practices#
Data Validation#
Always validate inputs: Check that coordinate names exist in datasets
Verify CRS compatibility: Ensure source and target CRS are properly specified
Test with subsets: Validate workflows with small spatial/temporal subsets first
Performance Optimization#
Spatial subsetting: Use target geometry bounds to minimize data loading
Temporal filtering: Specify precise time ranges to reduce memory usage
Coordinate systems: Use appropriate projections for your study area
Engine selection: Choose computational engines based on dataset size and available resources
Memory management: Monitor and optimize memory usage during processing
Troubleshooting#
Common Issues#
KeyError: Coordinate not found#
# Check available coordinates
ds = xr.open_dataset("data.nc")
print(f"Available coordinates: {list(ds.coords.keys())}")
print(f"Available variables: {list(ds.data_vars.keys())}")
ValueError: Invalid CRS#
# Verify CRS specification
from pyproj import CRS
try:
crs = CRS.from_user_input(your_crs)
print(f"Valid CRS: {crs}")
except Exception as e:
print(f"Invalid CRS: {e}")
No spatial intersection#
# Check geometry bounds
print(f"Target bounds: {target_gdf.total_bounds}")
print(f"Source bounds: {source_ds.rio.bounds()}")
Note
All data classes automatically handle coordinate reference system validation and spatial intersection checking. If spatial overlap is not detected, the classes will raise informative errors with suggestions for resolution.
Warning
When working with large datasets, be mindful of memory usage. Consider using spatial and temporal subsetting to reduce data volume before processing.
See Also#
Weight Generation Classes: Weight generation using user data classes
Aggregation Classes: Aggregation methods for processed data
Zonal Statistics Generation: Zonal statistics for raster data