Aggregation Classes#
The aggregation module provides classes for performing area-weighted statistics on gridded data. These classes transform raw gridded datasets into meaningful statistics aggregated over polygon geometries or interpolated along polyline geometries.
gdptools offers two main aggregation approaches:
AggGen: Area-weighted aggregation for polygon geometries using precomputed or dynamically calculated weightsInterpGen: Point-based interpolation and statistics along polyline geometries
Both classes support multiple processing engines (serial, parallel, and Dask) and various output formats for flexible deployment in different computational environments.
Key Features#
Statistical Methods#
Basic statistics: mean, median, min, max, sum, count, standard deviation
Masked statistics: Versions that handle nodata values appropriately
Weighted calculations: Area-weighted statistics for accurate spatial aggregation
Processing Engines#
Serial: Sequential processing for smaller datasets or debugging
Parallel: Multi-core processing using joblib for improved performance
Dask: Distributed computing for large datasets or cluster environments
Output Formats#
CSV: Tabular data with statistics per polygon/time
Parquet: Efficient columnar storage for large datasets
NetCDF: CF-compliant format for scientific data interchange
JSON: Structured data for web applications and APIs
Grid-to-Polygon Aggregation (AggGen)#
The AggGen class performs area-weighted aggregation of gridded data over polygon geometries. It’s designed for climate data analysis, hydrological modeling, and other applications requiring spatially aggregated statistics.
- class AggGen(user_data, stat_method, agg_engine, agg_writer, weights, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=-1)[source]
Bases:
objectPerforms grid-to-polygon aggregation using area-weighted statistics.
This class provides functionality to aggregate gridded data over polygon geometries using various statistical methods and processing engines.
- Parameters:
user_data (UserData) – Input data for aggregation (e.g., UserCatData).
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method to apply for aggregation.
agg_engine (Literal['serial', 'parallel', 'dask']) – Aggregation engine to use for processing.
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output writer format for results.
weights (str | DataFrame) – Path to CSV file or DataFrame containing area weights.
out_path (str | None) – Directory path for output files. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – Whether to append current date to output file names.
precision (int | None) – Number of decimal places for output data rounding.
jobs (int | None) – Number of processors for parallel or dask engines. -1 uses all available.
- agg_data
Dictionary mapping variable names to AggData instances after processing.
- Raises:
ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.
Examples
- Basic aggregation with CSV output:
>>> agg = AggGen( ... user_data=my_data, ... stat_method="mean", ... agg_engine="serial", ... agg_writer="csv", ... weights="weights.csv", ... out_path="/output", ... file_prefix="results" ... ) >>> gdf, dataset = agg.calculate_agg()
- Parallel processing with NetCDF output:
>>> agg = AggGen( ... user_data=my_data, ... stat_method="masked_mean", ... agg_engine="parallel", ... agg_writer="netcdf", ... weights=weights_df, ... out_path="/output", ... file_prefix="climate_data", ... jobs=4 ... ) >>> gdf, dataset = agg.calculate_agg()
Initialize the AggGen class with configuration parameters.
Sets up the aggregation system by configuring the statistical method, processing engine, and output writer based on the provided parameters.
- Parameters:
user_data (UserData) – Input data container with source data and target geometries.
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method for aggregation (e.g., ‘mean’, ‘masked_mean’).
agg_engine (Literal['serial', 'parallel', 'dask']) – Processing engine (‘serial’, ‘parallel’, or ‘dask’).
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output format (‘none’, ‘csv’, ‘parquet’, ‘netcdf’, ‘json’).
weights (str | DataFrame) – Path to weights CSV file or DataFrame with precomputed weights.
out_path (str | None) – Output directory path. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – If True, append current date to output filenames.
precision (int | None) – Number of decimal places for rounding output values.
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.
- Raises:
ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.
- __init__(user_data, stat_method, agg_engine, agg_writer, weights, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=-1)[source]
Initialize the AggGen class with configuration parameters.
Sets up the aggregation system by configuring the statistical method, processing engine, and output writer based on the provided parameters.
- Parameters:
user_data (UserData) – Input data container with source data and target geometries.
stat_method (Literal['masked_mean', 'mean', 'masked_std', 'std', 'masked_median', 'median', 'masked_count', 'count', 'masked_sum', 'sum', 'masked_min', 'min', 'masked_max', 'max']) – Statistical method for aggregation (e.g., ‘mean’, ‘masked_mean’).
agg_engine (Literal['serial', 'parallel', 'dask']) – Processing engine (‘serial’, ‘parallel’, or ‘dask’).
agg_writer (Literal['none', 'csv', 'parquet', 'netcdf', 'json']) – Output format (‘none’, ‘csv’, ‘parquet’, ‘netcdf’, ‘json’).
weights (str | DataFrame) – Path to weights CSV file or DataFrame with precomputed weights.
out_path (str | None) – Output directory path. Required if agg_writer is not ‘none’.
file_prefix (str | None) – Prefix for output file names. Required if agg_writer is not ‘none’.
append_date (bool) – If True, append current date to output filenames.
precision (int | None) – Number of decimal places for rounding output values.
jobs (int | None) – Number of processors for parallel processing. -1 uses all available.
- Raises:
ValueError – If agg_writer is not ‘none’ but out_path or file_prefix is missing.
TypeError – If stat_method, agg_engine, or agg_writer is invalid.
- calculate_agg()[source]
Calculate area-weighted aggregations for target polygons.
Performs the complete aggregation workflow: interpolates source gridded data to target polygons, computes the specified statistic, and optionally writes results to the specified output format.
- Returns:
A GeoDataFrame with target polygons and computed statistics.
An xarray Dataset with aggregated values in CF-compliant format.
- Return type:
A tuple containing
- Raises:
TypeError – If writer or engine configuration is invalid.
ValueError – If output path or file prefix is missing when writing is enabled.
Examples
>>> agg = AggGen(user_data, "mean", "serial", "csv", weights_df) >>> gdf, dataset = agg.calculate_agg() >>> print(f"Processed {len(gdf)} polygons")
- property agg_data: dict[str, AggData]
Get the aggregation data collected during processing.
- Returns:
A mapping from variable name to the corresponding
AggDatainstance, which contains metadata and processed data for each variable.- Return type:
Notes
This property is populated only after calling
calculate_agg().
InterpGen Examples#
Basic Aggregation#
# Basic aggregation with CSV output
from gdptools.agg_gen import AggGen
agg = AggGen(
user_data=climate_data,
stat_method="mean",
agg_engine="serial",
agg_writer="csv",
weights=weights_df,
out_path="./output",
file_prefix="climate_stats"
)
gdf, dataset = agg.calculate_agg()
# Access aggregated data
print(f"Processed {len(gdf)} polygons")
print(f"Variables: {list(dataset.data_vars)}")
Parallel Processing with Advanced Options#
# Parallel processing with NetCDF output
agg = AggGen(
user_data=climate_data,
stat_method="masked_mean",
agg_engine="parallel",
agg_writer="netcdf",
weights="weights.csv",
out_path="./output",
file_prefix="aggregated_climate",
jobs=4,
precision=2,
append_date=True
)
gdf, dataset = agg.calculate_agg()
# The agg_data property contains detailed processing information
processing_info = agg.agg_data
print(f"Processed variables: {list(processing_info.keys())}")
Polyline Interpolation (InterpGen)#
The InterpGen class interpolates gridded data along polyline geometries at specified intervals and computes statistics. This is useful for analyzing data along rivers, roads, transects, or other linear features.
- class InterpGen(user_data, *, pt_spacing=50, stat='all', interp_method='linear', mask_data=False, output_file=None, calc_crs=6931, method='serial', jobs=-1)[source]
Bases:
objectCalculates grid statistics along polyline geometries.
This class provides functionality to interpolate gridded data along polyline geometries at specified point intervals and compute statistics.
- Parameters:
user_data (UserData) – Input data container with source data and target polylines.
pt_spacing (float | int | None) – Spacing between interpolation points in meters. If
None, uses default spacing based on line geometry.stat (str) – Statistic to calculate (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – Whether to mask nodata values during interpolation.
output_file (str | None) – Path to CSV file for saving results. If
None, no file is written.calc_crs (str | int | CRS) – Coordinate reference system for interpolation calculations. Can be EPSG code, WKT string, or
pyproj.CRSobject.method (Literal['serial', 'parallel', 'dask']) – Interpolation engine to use for processing.
jobs (int | None) – Number of processors for parallel or dask engines.
-1uses all available.
- Raises:
ValueError – If the specified interpolation method is not supported.
Examples
- Basic line interpolation:
>>> interp = InterpGen( ... user_data=my_data, ... pt_spacing=100, ... stat="mean", ... interp_method="linear" ... ) >>> stats, points = interp.calc_interp()
- Parallel processing with custom CRS:
>>> interp = InterpGen( ... user_data=my_data, ... pt_spacing=50, ... stat="all", ... calc_crs=3857, ... method="parallel", ... jobs=4 ... ) >>> stats, points = interp.calc_interp()
Initialize the InterpGen class with configuration parameters.
Sets up the interpolation system for calculating statistics along polyline geometries using the specified interpolation method and processing engine.
- Parameters:
user_data (UserData) – Input data container with source gridded data and target polylines.
pt_spacing (float | int | None) – Distance between interpolation points in meters. Default is 50m.
stat (str) – Statistical method to apply (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – If
True, mask nodata values during interpolation.output_file (str | None) – Path to CSV file for saving results. If
None, no file is written.calc_crs (str | int | CRS) – Coordinate reference system for calculations. Default is EPSG:6931.
method (Literal['serial', 'parallel', 'dask']) – Processing engine (“serial”, “parallel”, “dask”).
jobs (int | None) – Number of processors for parallel processing.
-1uses all available.
- __init__(user_data, *, pt_spacing=50, stat='all', interp_method='linear', mask_data=False, output_file=None, calc_crs=6931, method='serial', jobs=-1)[source]
Initialize the InterpGen class with configuration parameters.
Sets up the interpolation system for calculating statistics along polyline geometries using the specified interpolation method and processing engine.
- Parameters:
user_data (UserData) – Input data container with source gridded data and target polylines.
pt_spacing (float | int | None) – Distance between interpolation points in meters. Default is 50m.
stat (str) – Statistical method to apply (“all”, “mean”, “median”, “min”, “max”, “std”).
interp_method (str) – xarray interpolation method (“linear”, “nearest”, “cubic”).
mask_data (bool) – If
True, mask nodata values during interpolation.output_file (str | None) – Path to CSV file for saving results. If
None, no file is written.calc_crs (str | int | CRS) – Coordinate reference system for calculations. Default is EPSG:6931.
method (Literal['serial', 'parallel', 'dask']) – Processing engine (“serial”, “parallel”, “dask”).
jobs (int | None) – Number of processors for parallel processing.
-1uses all available.
- calc_interp()[source]
Run interpolation and statistical calculations along polylines.
Performs the complete interpolation workflow: generates points along polylines at specified intervals, interpolates gridded data to these points, and computes the requested statistics.
- Returns:
Statistical results and interpolated points. Return type depends on the stat parameter: - If stat is ‘all’: tuple of (statistics DataFrame, points GeoDataFrame) - Otherwise: statistics DataFrame only
- Raises:
ValueError – If the specified interpolation method is not supported.
- Return type:
Examples
>>> interp = InterpGen(user_data, pt_spacing=100, stat="mean") >>> stats = interp.calc_interp() >>> print(f"Mean values: {stats['mean'].values}")
>>> interp = InterpGen(user_data, pt_spacing=50, stat="all") >>> stats, points = interp.calc_interp() >>> print(f"Generated {len(points)} interpolation points")
Usage Examples#
Basic Line Interpolation#
# Basic line interpolation
from gdptools.agg_gen import InterpGen
interp = InterpGen(
user_data=river_data,
pt_spacing=100, # 100-meter intervals
stat="mean",
interp_method="linear"
)
stats = interp.calc_interp()
# Display results
print(f"Mean values along line: {stats['mean'].values}")
Comprehensive Statistics with Custom Configuration#
# Comprehensive statistics with custom CRS
interp = InterpGen(
user_data=river_data,
pt_spacing=50,
stat="all", # Returns all statistics
calc_crs=3857, # Web Mercator
method="parallel",
jobs=2,
output_file="river_stats.csv"
)
stats, points = interp.calc_interp()
# Access detailed results
print(f"Generated {len(points)} interpolation points")
print(f"Statistics available: {list(stats.columns)}")
Type Definitions#
The module provides several literal types for configuration options:
- STATSMETHODS#
Available aggregation methods.
- Options:
masked_mean: Masked mean of the data. mean: Mean of the data. masked_std: Masked standard deviation of the data. std: Standard deviation of the data. masked_median: Masked median of the data. median: Median of the data. masked_count: Masked count of the data. count: Count of the data. masked_sum: Masked sum of the data. sum: Sum of the data. masked_min: Masked minimum of the data. min: Minimum of the data. masked_max: Masked maximum of the data. max: Maximum of the data.
alias of
Literal[‘masked_mean’, ‘mean’, ‘masked_std’, ‘std’, ‘masked_median’, ‘median’, ‘masked_count’, ‘count’, ‘masked_sum’, ‘sum’, ‘masked_min’, ‘min’, ‘masked_max’, ‘max’]
- AGGENGINES#
Available aggregation engines.
- Options:
serial: Perform area-weighted aggregation sequentially. parallel: Perform area-weighted aggregation in parallel. dask: Perform area-weighted aggregation with Dask.
alias of
Literal[‘serial’, ‘parallel’, ‘dask’]
Best Practices#
Performance Considerations#
Use
"serial"engine for datasets < 1GB and debuggingUse
"parallel"engine for moderate datasets (1-10GB) on multi-core systemsUse
"dask"engine for large datasets (>10GB) or distributed computing
Memory Management#
Set appropriate
jobsparameter based on available memory. If you request more workers than the machine has physical CPU cores,AggGenclamps the value and raises a warning so you know the engine throttled your request.Use
precisionparameter to control output file sizesConsider chunking large datasets before processing
Output Format Selection#
CSV: Human-readable, good for small to medium datasets
Parquet: Efficient for large datasets, preserves data types
NetCDF: Standard for scientific data, CF-compliant
JSON: Structured data for web applications
Error Handling#
Validate input data and geometries before processing
Use masked statistics (
masked_*) for datasets with nodata valuesTest with small subsets before processing large datasets
Note
All aggregation classes automatically handle coordinate reference system (CRS) transformations and ensure proper alignment between source data and target geometries.
Warning
When using parallel or Dask engines, ensure sufficient memory is available. Large datasets may require chunking or distributed processing.