Aggregation Classes

Aggregation Classes#

The aggregation module provides classes for performing area-weighted statistics on gridded data. These classes transform raw gridded datasets into meaningful statistics aggregated over polygon geometries or interpolated along polyline geometries.

gdptools offers two main aggregation approaches:

AggGen: Area-weighted aggregation for polygon geometries using precomputed or dynamically calculated weights
InterpGen: Point-based interpolation and statistics along polyline geometries

Both classes support multiple processing engines (serial, parallel, and Dask) and various output formats for flexible deployment in different computational environments.

Key Features #

Statistical Methods #

Basic statistics: mean, median, min, max, sum, count, standard deviation
Masked statistics: Versions that handle nodata values appropriately
Weighted calculations: Area-weighted statistics for accurate spatial aggregation

Processing Engines #

Serial: Sequential processing for smaller datasets or debugging
Parallel: Multi-core processing using joblib for improved performance
Dask: Distributed computing for large datasets or cluster environments

Output Formats #

CSV: Tabular data with statistics per polygon/time
Parquet: Efficient columnar storage for large datasets
NetCDF: CF-compliant format for scientific data interchange
JSON: Structured data for web applications and APIs

Grid-to-Polygon Aggregation (`AggGen`)#

The AggGen class performs area-weighted aggregation of gridded data over polygon geometries. It’s designed for climate data analysis, hydrological modeling, and other applications requiring spatially aggregated statistics.

class AggGen(user_data, stat_method, agg_engine, agg_writer, weights, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=-1)[source]

Bases: object

Performs grid-to-polygon aggregation using area-weighted statistics.

This class provides functionality to aggregate gridded data over polygon geometries using various statistical methods and processing engines.

Parameters:

user_data (UserData) – Input data for aggregation (e.g., UserCatData).
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method to apply for aggregation.
agg_engine (Literal['serial', 'parallel', 'dask']) – Aggregation engine to use for processing.
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output writer format for results.
weights (str | DataFrame) – Path to CSV file or DataFrame containing area weights.
out_path (str | None) – Directory path for output files. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – Whether to append current date to output file names.
precision (int | None) – Number of decimal places for output data rounding.
jobs (int | None) – Number of processors for parallel or dask engines. -1 uses all available.

agg_data: Dictionary mapping variable names to AggData instances after processing.

Raises:

ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.

Examples

Basic aggregation with CSV output:

>>> agg = AggGen(
...     user_data=my_data,
...     stat_method="mean",
...     agg_engine="serial",
...     agg_writer="csv",
...     weights="weights.csv",
...     out_path="/output",
...     file_prefix="results"
... )
>>> gdf, dataset = agg.calculate_agg()

Parallel processing with NetCDF output:

>>> agg = AggGen(
...     user_data=my_data,
...     stat_method="masked_mean",
...     agg_engine="parallel",
...     agg_writer="netcdf",
...     weights=weights_df,
...     out_path="/output",
...     file_prefix="climate_data",
...     jobs=4
... )
>>> gdf, dataset = agg.calculate_agg()

Initialize the AggGen class with configuration parameters.

Sets up the aggregation system by configuring the statistical method, processing engine, and output writer based on the provided parameters.

Parameters:

user_data (UserData) – Input data container with source data and target geometries.
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method for aggregation (e.g., ‘mean’, ‘masked_mean’).
agg_engine (Literal['serial', 'parallel', 'dask']) – Processing engine (‘serial’, ‘parallel’, or ‘dask’). (‘dask’ is deprecated, removal in 0.4.0)
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output format (‘none’, ‘csv’, ‘parquet’, ‘netcdf’, ‘json’).
weights (str | DataFrame) – Path to weights CSV file or DataFrame with precomputed weights.
out_path (str | None) – Output directory path. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – If True, append current date to output filenames.
precision (int | None) – Number of decimal places for rounding output values.
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.

Raises:

ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.

__init__(user_data, stat_method, agg_engine, agg_writer, weights, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=-1)[source]

Initialize the AggGen class with configuration parameters.

Sets up the aggregation system by configuring the statistical method, processing engine, and output writer based on the provided parameters.

Parameters:

user_data (UserData) – Input data container with source data and target geometries.
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method for aggregation (e.g., ‘mean’, ‘masked_mean’).
agg_engine (Literal['serial', 'parallel', 'dask']) – Processing engine (‘serial’, ‘parallel’, or ‘dask’). (‘dask’ is deprecated, removal in 0.4.0)
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output format (‘none’, ‘csv’, ‘parquet’, ‘netcdf’, ‘json’).
weights (str | DataFrame) – Path to weights CSV file or DataFrame with precomputed weights.
out_path (str | None) – Output directory path. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – If True, append current date to output filenames.
precision (int | None) – Number of decimal places for rounding output values.
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.

Raises:

ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.

calculate_agg()[source]

Calculate area-weighted aggregations for target polygons.

Performs the complete aggregation workflow: interpolates source gridded data to target polygons, computes the specified statistic, and optionally writes results to the specified output format.

Returns:

A GeoDataFrame with target polygons and computed statistics.
An xarray Dataset with aggregated values in CF-compliant format.

Return type:

A tuple containing

Raises:

TypeError – If writer or engine configuration is invalid.
ValueError – If output path or file prefix is missing when writing is enabled.

Examples

>>> agg = AggGen(user_data, "mean", "serial", "csv", weights_df)
>>> gdf, dataset = agg.calculate_agg()
>>> print(f"Processed {len(gdf)} polygons")

property agg_data: dict[str, AggData]

Get the aggregation data collected during processing.

Returns:: A mapping from variable name to the corresponding AggData instance, which contains metadata and processed data for each variable.
Return type:: dict[str, AggData]

Notes

This property is populated only after calling calculate_agg().

InterpGen Examples #

Basic Aggregation#

# Basic aggregation with CSV output
from gdptools.agg_gen import AggGen

agg = AggGen(
    user_data=climate_data,
    stat_method="mean",
    agg_engine="serial",
    agg_writer="csv",
    weights=weights_df,
    out_path="./output",
    file_prefix="climate_stats"
)
gdf, dataset = agg.calculate_agg()

# Access aggregated data
print(f"Processed {len(gdf)} polygons")
print(f"Variables: {list(dataset.data_vars)}")

Parallel Processing with Advanced Options#

# Parallel processing with NetCDF output
agg = AggGen(
    user_data=climate_data,
    stat_method="masked_mean",
    agg_engine="parallel",
    agg_writer="netcdf",
    weights="weights.csv",
    out_path="./output",
    file_prefix="aggregated_climate",
    jobs=4,
    precision=2,
    append_date=True
)
gdf, dataset = agg.calculate_agg()

# The agg_data property contains detailed processing information
processing_info = agg.agg_data
print(f"Processed variables: {list(processing_info.keys())}")

Polyline Interpolation (`InterpGen`)#

The InterpGen class interpolates gridded data along polyline geometries at specified intervals and computes statistics. This is useful for analyzing data along rivers, roads, transects, or other linear features.

class InterpGen(user_data, *, pt_spacing=50, stat='all', interp_method='linear', mask_data=False, output_file=None, calc_crs=6931, method='serial', jobs=-1)[source]

Bases: object

Calculates grid statistics along polyline geometries.

This class provides functionality to interpolate gridded data along polyline geometries at specified point intervals and compute statistics.

Parameters:

user_data (UserData) – Input data container with source data and target polylines.
pt_spacing (float | int | None) – Spacing between interpolation points in meters. If None, uses default spacing based on line geometry.
stat (str) – Statistic to calculate (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – Whether to mask nodata values during interpolation.
output_file (str | None) – Path to CSV file for saving results. If None, no file is written.
calc_crs (str | int | CRS) – Coordinate reference system for interpolation calculations. Can be EPSG code, WKT string, or pyproj.CRS object.
method (Literal['serial', 'parallel', 'dask']) – Interpolation engine to use for processing.
jobs (int | None) – Number of processors for parallel or dask engines. -1 uses all available.

Raises:

ValueError – If the specified interpolation method is not supported.

Examples

Basic line interpolation:

>>> interp = InterpGen(
...     user_data=my_data,
...     pt_spacing=100,
...     stat="mean",
...     interp_method="linear"
... )
>>> stats, points = interp.calc_interp()

Parallel processing with custom CRS:

>>> interp = InterpGen(
...     user_data=my_data,
...     pt_spacing=50,
...     stat="all",
...     calc_crs=3857,
...     method="parallel",
...     jobs=4
... )
>>> stats, points = interp.calc_interp()

Initialize the InterpGen class with configuration parameters.

Sets up the interpolation system for calculating statistics along polyline geometries using the specified interpolation method and processing engine.

Parameters:

user_data (UserData) – Input data container with source gridded data and target polylines.
pt_spacing (float | int | None) – Distance between interpolation points in meters. Default is 50m.
stat (str) – Statistical method to apply (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – If True, mask nodata values during interpolation.
output_file (str | None) – Path to CSV file for saving results. If None, no file is written.
calc_crs (str | int | CRS) – Coordinate reference system for calculations. Default is EPSG:6931.
method (Literal['serial', 'parallel', 'dask']) – Processing engine (“serial”, “parallel”, “dask”). (‘dask’ is deprecated, removal in 0.4.0)
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.

__init__(user_data, *, pt_spacing=50, stat='all', interp_method='linear', mask_data=False, output_file=None, calc_crs=6931, method='serial', jobs=-1)[source]

Initialize the InterpGen class with configuration parameters.

Sets up the interpolation system for calculating statistics along polyline geometries using the specified interpolation method and processing engine.

Parameters:

user_data (UserData) – Input data container with source gridded data and target polylines.
pt_spacing (float | int | None) – Distance between interpolation points in meters. Default is 50m.
stat (str) – Statistical method to apply (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – If True, mask nodata values during interpolation.
output_file (str | None) – Path to CSV file for saving results. If None, no file is written.
calc_crs (str | int | CRS) – Coordinate reference system for calculations. Default is EPSG:6931.
method (Literal['serial', 'parallel', 'dask']) – Processing engine (“serial”, “parallel”, “dask”). (‘dask’ is deprecated, removal in 0.4.0)
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.

calc_interp()[source]

Run interpolation and statistical calculations along polylines.

Performs the complete interpolation workflow: generates points along polylines at specified intervals, interpolates gridded data to these points, and computes the requested statistics.

Returns:: Statistical results and interpolated points. Return type depends on the stat parameter: - If stat is ‘all’: tuple of (statistics DataFrame, points GeoDataFrame) - Otherwise: statistics DataFrame only
Raises:: ValueError – If the specified interpolation method is not supported.
Return type:: tuple[DataFrame, GeoDataFrame] | DataFrame

Examples

>>> interp = InterpGen(user_data, pt_spacing=100, stat="mean")
>>> stats = interp.calc_interp()
>>> print(f"Mean values: {stats['mean'].values}")

>>> interp = InterpGen(user_data, pt_spacing=50, stat="all")
>>> stats, points = interp.calc_interp()
>>> print(f"Generated {len(points)} interpolation points")

Usage Examples #

Basic Line Interpolation#

# Basic line interpolation
from gdptools.agg_gen import InterpGen

interp = InterpGen(
    user_data=river_data,
    pt_spacing=100,  # 100-meter intervals
    stat="mean",
    interp_method="linear"
)
stats = interp.calc_interp()

# Display results
print(f"Mean values along line: {stats['mean'].values}")

Comprehensive Statistics with Custom Configuration#

# Comprehensive statistics with custom CRS
interp = InterpGen(
    user_data=river_data,
    pt_spacing=50,
    stat="all",  # Returns all statistics
    calc_crs=3857,  # Web Mercator
    method="parallel",
    jobs=2,
    output_file="river_stats.csv"
)
stats, points = interp.calc_interp()

# Access detailed results
print(f"Generated {len(points)} interpolation points")
print(f"Statistics available: {list(stats.columns)}")

Type Definitions #

The module provides several literal types for configuration options:

STATSMETHODS#

Available aggregation methods.

Options:: masked_mean: Masked mean of the data. mean: Mean of the data. masked_std: Masked standard deviation of the data. std: Standard deviation of the data. masked_median: Masked median of the data. median: Median of the data. masked_count: Masked count of the data. count: Count of the data. masked_sum: Masked sum of the data. sum: Sum of the data. masked_min: Masked minimum of the data. min: Minimum of the data. masked_max: Masked maximum of the data. max: Maximum of the data.

alias of Literal[‘masked_mean’, ‘mean’, ‘masked_std’, ‘std’, ‘masked_median’, ‘median’, ‘masked_count’, ‘count’, ‘masked_sum’, ‘sum’, ‘masked_min’, ‘min’, ‘masked_max’, ‘max’]

AGGENGINES#

Available aggregation engines.

Options:: serial: Perform area-weighted aggregation sequentially. parallel: Perform area-weighted aggregation in parallel. dask: Perform area-weighted aggregation with Dask.

Deprecated — will be removed in gdptools 0.4.0. Use ‘parallel’ or ‘serial’.

alias of Literal[‘serial’, ‘parallel’, ‘dask’]

AGGWRITERS#

Available output writers.

Options:: none: Do not write output to file. csv: Write output in CSV format. parquet: Write output in Parquet format. netcdf: Write output in NetCDF format. json: Write output in JSON format.

alias of Literal[‘none’, ‘csv’, ‘parquet’, ‘netcdf’, ‘json’]

LINEITERPENGINES#

Available line interpolation engines.

Options:: serial: Perform interpolation sequentially. parallel: Perform interpolation in parallel. dask: Perform interpolation with Dask.

Deprecated — will be removed in gdptools 0.4.0. Use ‘parallel’ or ‘serial’.

alias of Literal[‘serial’, ‘parallel’, ‘dask’]

Best Practices #

Performance Considerations #

Use "serial" engine for datasets < 1GB and debugging
Use "parallel" engine for moderate datasets (1-10GB) on multi-core systems
Use "dask" engine for large datasets (>10GB) or distributed computing

Memory Management #

Set appropriate jobs parameter based on available memory. If you request more workers than the machine has physical CPU cores, AggGen clamps the value and raises a warning so you know the engine throttled your request.
Use precision parameter to control output file sizes
Consider chunking large datasets before processing

Output Format Selection #

CSV: Human-readable, good for small to medium datasets
Parquet: Efficient for large datasets, preserves data types
NetCDF: Standard for scientific data, CF-compliant
JSON: Structured data for web applications

Error Handling #

Validate input data and geometries before processing
Use masked statistics (masked_*) for datasets with nodata values
Test with small subsets before processing large datasets

Note

All aggregation classes automatically handle coordinate reference system (CRS) transformations and ensure proper alignment between source data and target geometries.

Warning

When using parallel or Dask engines, ensure sufficient memory is available. Large datasets may require chunking or distributed processing.

Aggregation Classes

Contents

Aggregation Classes#

Basic Aggregation#

Parallel Processing with Advanced Options#

Basic Line Interpolation#

Comprehensive Statistics with Custom Configuration#