# Aggregation Classes

(grid-to-polygon-aggregation-agggen)=

The aggregation module provides classes for performing area-weighted statistics on gridded data. These classes transform raw gridded datasets into meaningful statistics aggregated over polygon geometries or interpolated along polyline geometries.

`gdptools` offers two main aggregation approaches:

- **[`AggGen`](#grid-to-polygon-aggregation-agggen)**: Area-weighted aggregation for polygon geometries using precomputed or dynamically calculated weights
- **[`InterpGen`](#polyline-interpolation-interpgen)**: Point-based interpolation and statistics along polyline geometries

Both classes support multiple processing engines (serial, parallel, and Dask) and various output formats for flexible deployment in different computational environments.

```{contents}
:local:
:depth: 2
```

---

## Key Features

### Statistical Methods

- **Basic statistics**: mean, median, min, max, sum, count, standard deviation
- **Masked statistics**: Versions that handle nodata values appropriately
- **Weighted calculations**: Area-weighted statistics for accurate spatial aggregation

### Processing Engines

- **Serial**: Sequential processing for smaller datasets or debugging
- **Parallel**: Multi-core processing using joblib for improved performance
- **Dask**: Distributed computing for large datasets or cluster environments

### Output Formats

- **CSV**: Tabular data with statistics per polygon/time
- **Parquet**: Efficient columnar storage for large datasets
- **NetCDF**: CF-compliant format for scientific data interchange
- **JSON**: Structured data for web applications and APIs

---

## Grid-to-Polygon Aggregation (`AggGen`)

The `AggGen` class performs area-weighted aggregation of gridded data over polygon geometries. It's designed for climate data analysis, hydrological modeling, and other applications requiring spatially aggregated statistics.

```{eval-rst}
.. autoclass:: gdptools.agg_gen.AggGen
    :members:
    :undoc-members:
    :show-inheritance:
    :special-members: __init__
    :no-index:
```

### InterpGen Examples

#### Basic Aggregation

```python
# Basic aggregation with CSV output
from gdptools.agg_gen import AggGen

agg = AggGen(
    user_data=climate_data,
    stat_method="mean",
    agg_engine="serial",
    agg_writer="csv",
    weights=weights_df,
    out_path="./output",
    file_prefix="climate_stats"
)
gdf, dataset = agg.calculate_agg()

# Access aggregated data
print(f"Processed {len(gdf)} polygons")
print(f"Variables: {list(dataset.data_vars)}")
```

#### Parallel Processing with Advanced Options

```python
# Parallel processing with NetCDF output
agg = AggGen(
    user_data=climate_data,
    stat_method="masked_mean",
    agg_engine="parallel",
    agg_writer="netcdf",
    weights="weights.csv",
    out_path="./output",
    file_prefix="aggregated_climate",
    jobs=4,
    precision=2,
    append_date=True
)
gdf, dataset = agg.calculate_agg()

# The agg_data property contains detailed processing information
processing_info = agg.agg_data
print(f"Processed variables: {list(processing_info.keys())}")
```

---

## Polyline Interpolation (`InterpGen`)

(polyline-interpolation-interpgen)=

The `InterpGen` class interpolates gridded data along polyline geometries at specified intervals and computes statistics. This is useful for analyzing data along rivers, roads, transects, or other linear features.

```{eval-rst}
.. autoclass:: gdptools.agg_gen.InterpGen
    :members:
    :undoc-members:
    :show-inheritance:
    :special-members: __init__
    :no-index:
```

### Usage Examples

#### Basic Line Interpolation

```python
# Basic line interpolation
from gdptools.agg_gen import InterpGen

interp = InterpGen(
    user_data=river_data,
    pt_spacing=100,  # 100-meter intervals
    stat="mean",
    interp_method="linear"
)
stats = interp.calc_interp()

# Display results
print(f"Mean values along line: {stats['mean'].values}")
```

#### Comprehensive Statistics with Custom Configuration

```python
# Comprehensive statistics with custom CRS
interp = InterpGen(
    user_data=river_data,
    pt_spacing=50,
    stat="all",  # Returns all statistics
    calc_crs=3857,  # Web Mercator
    method="parallel",
    jobs=2,
    output_file="river_stats.csv"
)
stats, points = interp.calc_interp()

# Access detailed results
print(f"Generated {len(points)} interpolation points")
print(f"Statistics available: {list(stats.columns)}")
```

---

## Type Definitions

The module provides several literal types for configuration options:

```{eval-rst}
.. autodata:: gdptools.agg_gen.STATSMETHODS
    :annotation: = Literal["masked_mean", "mean", "masked_std", "std", ...]

.. autodata:: gdptools.agg_gen.AGGENGINES
    :annotation: = Literal["serial", "parallel", "dask"]

.. autodata:: gdptools.agg_gen.AGGWRITERS
    :annotation: = Literal["none", "csv", "parquet", "netcdf", "json"]

.. autodata:: gdptools.agg_gen.LINEITERPENGINES
    :annotation: = Literal["serial", "parallel", "dask"]
```

---

## Best Practices

### Performance Considerations

- Use `"serial"` engine for datasets < 1GB and debugging
- Use `"parallel"` engine for moderate datasets (1-10GB) on multi-core systems
- Use `"dask"` engine for large datasets (>10GB) or distributed computing

### Memory Management

- Set appropriate `jobs` parameter based on available memory. If you request more
  workers than the machine has physical CPU cores, `AggGen` clamps the value and
  raises a warning so you know the engine throttled your request.
- Use `precision` parameter to control output file sizes
- Consider chunking large datasets before processing

### Output Format Selection

- **CSV**: Human-readable, good for small to medium datasets
- **Parquet**: Efficient for large datasets, preserves data types
- **NetCDF**: Standard for scientific data, CF-compliant
- **JSON**: Structured data for web applications

### Error Handling

- Validate input data and geometries before processing
- Use masked statistics (`masked_*`) for datasets with nodata values
- Test with small subsets before processing large datasets

```{note}
All aggregation classes automatically handle coordinate reference system (CRS) transformations and ensure proper alignment between source data and target geometries.
```

```{warning}
When using parallel or Dask engines, ensure sufficient memory is available. Large datasets may require chunking or distributed processing.
```
