# User Data Classes

(abstract-base-class-userdata)=

(climate-r-catalog-climrcatdata)=

(custom-datasets-usercatdata)=

(nhgf-stac-catalog-nhgfstacdata)=

(nhgf-stac-zarr-nhgfstaczarrdata)=

(nhgf-stac-tiff-nhgfstactiffdata)=

(geotiff-data-usertiffdata)=

The user data classes form the foundation of the `gdptools` package, providing standardized interfaces for accessing and processing geospatial data from diverse sources. These classes abstract the complexity of different data formats, coordinate systems, and access methods, presenting a unified API for spatial analysis workflows.

`gdptools` provides specialized data classes for common geospatial data sources:

- **[`UserData`](#abstract-base-class-userdata)**: Abstract base class defining the common interface
- **[`ClimRCatData`](#climate-r-catalog-climrcatdata)**: Climate-R catalog integration with automatic metadata handling
- **[`UserCatData`](#custom-datasets-usercatdata)**: User-provided xarray datasets with flexible configuration
- **[`NHGFStacData`](#nhgf-stac-catalog-nhgfstacdata)**: NHGF STAC catalog factory — auto-detects Zarr vs GeoTIFF
  - **[`NHGFStacZarrData`](#nhgf-stac-zarr-nhgfstaczarrdata)**: Zarr-backed STAC collections (e.g., CONUS404)
  - **[`NHGFStacTiffData`](#nhgf-stac-tiff-nhgfstactiffdata)**: GeoTIFF-backed STAC collections (e.g., NLCD)
- **[`UserTiffData`](#geotiff-data-usertiffdata)**: GeoTIFF and raster data processing with zonal statistics

All classes implement comprehensive data validation, coordinate reference system handling, and spatial/temporal subsetting capabilities.

```{contents}
:local:
:depth: 2
```

---

## Key Features

### Unified Interface

- **Consistent API**: All data classes implement the same abstract methods
- **Automatic validation**: CRS validation, coordinate name checking, and data compatibility
- **Error handling**: Comprehensive error messages with suggested fixes

### Data Source Support

- **Catalog integration**: Seamless access to Climate-R and NHGF STAC catalogs
- **File formats**: NetCDF, Zarr, GeoTIFF, and other xarray-compatible formats
- **Web services**: OPeNDAP, THREDDS, and other network data sources

### Spatial Operations

- **Automatic reprojection**: Handles CRS transformations between source and target data
- **Intersection validation**: Ensures spatial overlap between datasets and geometries
- **Efficient subsetting**: Spatially subset data to minimize memory usage and processing time

### Temporal Handling

- **Flexible time formats**: Support for various datetime formats and time coordinate names
- **Period subsetting**: Automatic temporal slicing based on user-specified ranges
- **Calendar handling**: Proper handling of different calendar systems

---

## Abstract Base Class (`UserData`)

The `UserData` class defines the common interface that all data source classes must implement. It establishes the contract for data preparation, spatial subsetting, and aggregation operations.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.UserData
    :members:
    :special-members: __init__
    :show-inheritance:
```

---

## Climate-R Catalog (`ClimRCatData`)

The `ClimRCatData` class provides seamless integration with the Climate-R catalog system, offering access to a wide range of climate datasets with automatic metadata handling.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.ClimRCatData
    :members:
    :special-members: __init__
    :show-inheritance:
```

### Examples (ClimRCatData)

#### Basic Climate-R Usage

```python
import pandas as pd
from gdptools.data.user_data import ClimRCatData

# Load Climate-R catalog
cat_url = "https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet"
cat = pd.read_parquet(cat_url)

# Query for GridMET temperature data
temp_params = cat.query("id == 'gridmet' & variable == 'tmmn'").to_dict("records")[0]
source_cat_dict = {"tmmn": temp_params}

# Initialize data source
data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Access data
variables = data.get_vars()
subset = data.get_source_subset("tmmn")
```

#### Multiple Variables

```python
# gridMET climate variables
gridmet_vars = ["tmmx", "tmmn", "pr"]
gridmet_params = [
    cat.query("id == 'gridmet' & variable == @var").to_dict("records")[0]
    for var in gridmet_vars
]
source_cat_dict = dict(zip(gridmet_vars, gridmet_params))

data = ClimRCatData(
    source_cat_dict=source_cat_dict,
    target_gdf=watersheds_gdf,
    target_id="watershed_id",
    source_time_period=["2015-01-01", "2021-12-31"]
)

# Process all variables
for var in data.get_vars():
    agg_data = data.prep_agg_data(var)
    print(f"Prepared {var} for aggregation")
```

---

## Custom Datasets (`UserCatData`)

The `UserCatData` class handles user-provided xarray datasets with full control over coordinate names, variable selection, and coordinate reference systems.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.UserCatData
    :members:
    :special-members: __init__
    :show-inheritance:
```

### Examples (UserCatData)

#### Local NetCDF File

```python
import xarray as xr
from gdptools.data.user_data import UserCatData

# Load local dataset
ds = xr.open_dataset("climate_data.nc")

data = UserCatData(
    source_ds=ds,
    source_crs=4326,
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["temperature", "precipitation"],
    target_gdf="regions.shp",
    target_crs=4326,
    target_id="region_id",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Prepare for aggregation
for var in data.get_vars():
    agg_data = data.prep_agg_data(var)
```

#### Remote Dataset

```python
# Access remote OPeNDAP dataset
data = UserCatData(
    source_ds="https://example.com/thredds/dodsC/dataset.nc",
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var="temperature",
    target_gdf=polygons_gdf,
    target_crs=3857,  # Web Mercator
    target_id="poly_id",
    source_time_period=["2019-01-01", "2021-12-31"]
)
```

#### Projected Coordinates

```python
# Work with projected coordinate system
data = UserCatData(
    source_ds="projected_data.nc",
    source_crs=3857,  # Web Mercator
    source_x_coord="x",
    source_y_coord="y",
    source_t_coord="time",
    source_var=["var1", "var2"],
    target_gdf="regions.gpkg",
    target_crs=4326,
    target_id="region_id",
    source_time_period=["2020-06-01", "2020-08-31"]
)
```

---

## NHGF STAC Catalog (`NHGFStacData`)

The NHGF STAC catalog hosts datasets in two storage formats: **Zarr** (multi-dimensional
time-series, e.g. CONUS404) and **GeoTIFF** (single raster per time step, e.g. NLCD land cover).

`NHGFStacData` is a **factory class** that auto-detects which format a STAC collection uses
and returns the appropriate concrete class:

| Collection format | Returned class     | Example datasets         |
| :---------------- | :----------------- | :----------------------- |
| Zarr              | `NHGFStacZarrData` | CONUS404, GridMET, PRISM |
| GeoTIFF           | `NHGFStacTiffData` | NLCD land cover          |

Most users should use the factory (`NHGFStacData`) and let auto-detection handle the rest.
The concrete classes can also be imported directly when you know the format in advance.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.NHGFStacData
    :members:
    :special-members: __new__
    :show-inheritance:
```

### Examples (NHGFStacData)

#### Zarr collection (auto-detected)

```python
from gdptools import NHGFStacData
from gdptools.helpers import get_stac_collection

collection = get_stac_collection("conus404_daily")
data = NHGFStacData(
    source_collection=collection,
    source_var=["PWAT"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2020-01-01", "2020-01-31"],
)
# type(data) is NHGFStacZarrData
```

#### GeoTIFF collection (auto-detected)

```python
collection = get_stac_collection("nlcd-LndCov")
data = NHGFStacData(
    source_collection=collection,
    source_var=["LndCov"],
    target_gdf="watersheds.shp",
    target_id="huc12",
    source_time_period=["2021-01-01", "2021-12-31"],
)
# type(data) is NHGFStacTiffData
```

```{note}
`source_time_period` is **required** for Zarr collections (continuous time-series must be
temporally subset). For GeoTIFF collections it is optional — when provided, it selects
which STAC item (year) to load.
```

---

### Zarr Collections (`NHGFStacZarrData`)

`NHGFStacZarrData` provides access to Zarr-backed datasets in the NHGF STAC catalog.
These are multi-dimensional datasets stored as cloud-optimized Zarr archives on S3, with
automatic spatiotemporal filtering and metadata handling.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.NHGFStacZarrData
    :members:
    :special-members: __init__
    :show-inheritance:
```

---

### GeoTIFF Collections (`NHGFStacTiffData`)

`NHGFStacTiffData` provides access to GeoTIFF-backed datasets in the NHGF STAC catalog
(e.g., NLCD land cover). Each STAC item contains one or more GeoTIFF assets representing
a single time step. The class handles remote S3 access, band selection, and spatial subsetting.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.NHGFStacTiffData
    :members:
    :special-members: __init__
    :show-inheritance:
```

---

## GeoTIFF Data (`UserTiffData`)

The `UserTiffData` class specializes in handling GeoTIFF and raster data for zonal statistics and spatial analysis operations.

```{eval-rst}
.. autoclass:: gdptools.data.user_data.UserTiffData
    :members:
    :special-members: __init__
    :show-inheritance:
```

### Usage Examples

#### Single Band Raster

```python
from gdptools.data.user_data import UserTiffData
from gdptools.zonal_gen import ZonalGen, WeightedZonalGen

# Process elevation data
data = UserTiffData(
    source_ds="elevation.tif",
    source_crs=4326,
    source_x_coord="x",
    source_y_coord="y",
    target_gdf="watersheds.shp",
    target_id="huc12",
    bname="band",   # name of the band dimension (if present)
    band=1,          # 1-indexed band selection for multi-band rasters
)

# Unweighted zonal statistics (continuous)
zonal = ZonalGen(
    user_data=data,
    zonal_engine="serial",
    zonal_writer="csv",
    out_path="./out",
    file_prefix="elev_zonal",
)
zonal_stats = zonal.calculate_zonal(categorical=False)

# Area-weighted zonal statistics (recommended for geographic CRS)
weighted = WeightedZonalGen(
    user_data=data,
    weight_gen_crs=6931,  # Equal-area CRS recommended
    zonal_engine="parallel",
    zonal_writer="csv",
    out_path="./out",
    file_prefix="elev_weighted",
    jobs=4,  # start modestly; each worker loads the source raster
)
weighted_stats = weighted.calculate_weighted_zonal(categorical=False)
```

---

## Best Practices

### Data Validation

- **Always validate inputs**: Check that coordinate names exist in datasets
- **Verify CRS compatibility**: Ensure source and target CRS are properly specified
- **Test with subsets**: Validate workflows with small spatial/temporal subsets first

### Performance Optimization

- **Spatial subsetting**: Use target geometry bounds to minimize data loading
- **Temporal filtering**: Specify precise time ranges to reduce memory usage
- **Coordinate systems**: Use appropriate projections for your study area
- **Engine selection**: Choose computational engines based on dataset size and available resources
- **Memory management**: Monitor and optimize memory usage during processing

---

## Troubleshooting

### Common Issues

#### KeyError: Coordinate not found

```python
# Check available coordinates
ds = xr.open_dataset("data.nc")
print(f"Available coordinates: {list(ds.coords.keys())}")
print(f"Available variables: {list(ds.data_vars.keys())}")
```

#### ValueError: Invalid CRS

```python
# Verify CRS specification
from pyproj import CRS
try:
    crs = CRS.from_user_input(your_crs)
    print(f"Valid CRS: {crs}")
except Exception as e:
    print(f"Invalid CRS: {e}")
```

#### No spatial intersection

```python
# Check geometry bounds
print(f"Target bounds: {target_gdf.total_bounds}")
print(f"Source bounds: {source_ds.rio.bounds()}")
```

```{note}
All data classes automatically handle coordinate reference system validation and spatial intersection checking. If spatial overlap is not detected, the classes will raise informative errors with suggestions for resolution.
```

```{warning}
When working with large datasets, be mindful of memory usage. Consider using spatial and temporal subsetting to reduce data volume before processing.
```

---

## See Also

- {doc}`weight_gen_classes`: Weight generation using user data classes
- {doc}`agg_gen_classes`: Aggregation methods for processed data
- {doc}`zonal_gen_classes`: Zonal statistics for raster data
