# Weight Generation Classes

(grid-to-polygon-weightgen)=

(polygon-to-polygon-weightgenp2p)=

Weight generation is the process of calculating spatial intersection weights between gridded datasets and vector polygon geometries. These weights represent the proportional area of overlap and are fundamental for performing accurate area-weighted statistical aggregations in climate and environmental data analysis.

`gdptools` provides comprehensive weight generation capabilities:

- **[`WeightGen`](#grid-to-polygon-weightgen)**: Grid-to-polygon intersection weight calculation with multiple processing engines
- **[`WeightGenP2P`](#polygon-to-polygon-weightgenp2p)**: Polygon-to-polygon intersection weight calculation for complex geometries

Both classes support various processing methods and can handle large datasets efficiently through parallel and distributed computing options.

```{contents}
:local:
:depth: 2
```

---

## Key Features

### Spatial Intersection Methods

- **Grid-to-polygon**: Calculate weights between regular grids and irregular polygons
- **Polygon-to-polygon**: Calculate weights between two sets of polygon geometries
- **Accurate area calculations**: Proper handling of coordinate reference systems and projections

### Processing Engines

- **Serial**: Sequential processing for smaller datasets or debugging
- **Parallel**: Multi-core processing for improved performance with many polygons
- **Dask**: Distributed computing for very large datasets or cluster environments

### Output Options

- **CSV files**: Save weights to disk for reuse in multiple analyses
- **In-memory DataFrames**: Direct processing without file I/O
- **Intersection geometries**: Optional detailed spatial intersection information

---

## Grid-to-Polygon (`WeightGen`)

The `WeightGen` class calculates intersection weights between gridded datasets (e.g., NetCDF, Zarr) and vector polygons. This is the most common use case for climate and environmental data analysis.

```{eval-rst}
.. autoclass:: gdptools.weight_gen.WeightGen
    :members:
    :undoc-members:
    :show-inheritance:
    :special-members: __init__
    :no-index:
```

### WeightGen Examples

#### Basic Weight Calculation

```python
# Basic weight generation with serial processing
from gdptools.weight_gen import WeightGen

weight_gen = WeightGen(
    user_data=climate_data,
    method="serial",
    weight_gen_crs=6931  # Equal-area projection
)
weights = weight_gen.calculate_weights()

# Display results
print(f"Generated {len(weights)} weight entries")
print(f"Target polygons: {weights['target_id'].nunique()}")
print(f"Grid cells involved: {len(weights[['i_index', 'j_index']].drop_duplicates())}")
```

#### Parallel Processing with File Output

```python
# Parallel processing with weights saved to file
weight_gen = WeightGen(
    user_data=climate_data,
    method="parallel",
    weight_gen_crs=6931,
    output_file="climate_weights.csv",
    jobs=4,
    verbose=True
)
weights = weight_gen.calculate_weights()

# Access grid cell information
grid_cells = weight_gen.grid_cells
print(f"Grid cells: {len(grid_cells)} cells")
```

#### Advanced: Intersection Geometries

```python
# Calculate detailed intersection geometries
weight_gen = WeightGen(
    user_data=climate_data,
    method="parallel",
    weight_gen_crs=6931,
    jobs=4
)
weights = weight_gen.calculate_weights(intersections=True)

# Access detailed intersection information
intersections = weight_gen.intersections
print(f"Intersection polygons: {len(intersections)}")
```

---

## Polygon-to-Polygon (`WeightGenP2P`)

The `WeightGenP2P` class calculates intersection weights between two sets of polygon geometries. This is essential for transferring data between different administrative boundaries, ecological regions, or other polygon-based spatial datasets.

```{eval-rst}
.. autoclass:: gdptools.weight_gen_p2p.WeightGenP2P
    :members:
    :undoc-members:
    :show-inheritance:
    :special-members: __init__
    :no-index:
```

### WeightGenP2P Examples

#### Basic Polygon-to-Polygon Weights

```python
# Calculate weights between two polygon datasets
from gdptools.weight_gen_p2p import WeightGenP2P

weight_gen = WeightGenP2P(
    target_poly=watersheds,
    target_poly_idx="watershed_id",
    source_poly=counties,
    source_poly_idx="county_id",
    method="serial",
    weight_gen_crs=5070  # Equal-area projection
)
weights = weight_gen.calculate_weights()

# Analyze results
print(f"Watershed-county intersections: {len(weights)}")
print(f"Weight statistics: min={weights['weight'].min():.4f}, max={weights['weight'].max():.4f}")

# Verify weights sum to 1 for each target (if source is continuous)
weight_sums = weights.groupby('target_id')['weight'].sum()
print(f"Weight sum range: {weight_sums.min():.4f} to {weight_sums.max():.4f}")
```

#### Advanced Configuration with Intersections

```python
# Parallel processing with intersection geometries
weight_gen = WeightGenP2P(
    target_poly=target_regions,
    target_poly_idx="region_id",
    source_poly=source_zones,
    source_poly_idx="zone_id",
    method="parallel",
    weight_gen_crs=5070,
    output_file="region_zone_weights.csv",
    intersections=True,
    jobs=6,
    verbose=True
)
weights = weight_gen.calculate_weights()

# Access detailed intersection geometries
intersections = weight_gen.intersections
print(f"Generated {len(intersections)} intersection polygons")
print(f"Intersection areas: {intersections.geometry.area.describe()}")
```

---

## Type Definitions

The modules provide literal types for processing method configuration:

```{eval-rst}
.. autodata:: gdptools.weight_gen.WEIGHT_GEN_METHODS
    :annotation: = Literal["serial", "parallel", "dask"]

.. autodata:: gdptools.weight_gen_p2p.WEIGHT_GEN_METHODS
    :annotation: = Literal["serial", "parallel", "dask"]
```

---

## Best Practices

### Coordinate Reference System Selection

- **Use equal-area projections** for accurate area calculations (e.g., Albers Equal Area, Lambert Azimuthal)
- **Choose appropriate regional projections** for your study area
- **Common choices**: EPSG:5070 (CONUS Albers), EPSG:6931 (CONUS Albers WGS84)

### Performance Optimization

- **Serial method**: Use for small datasets (<5000 polygons) or debugging
- **Parallel method**: Optimal for moderate datasets (5000-50,000 polygons)
- **Dask method**: Use for very large datasets (>50,000 polygons) or cluster computing

### Scaling to Nationwide Datasets

Nationwide basins, parcels, or census geometries often contain hundreds of thousands of polygons and
millions of grid cells. At this scale, keep memory in check by combining polygon-count heuristics with
out-of-core execution:

| Target polygon count             | Recommended engine | Guidance                                                                                                 |
| -------------------------------- | ------------------ | -------------------------------------------------------------------------------------------------------- |
| < 5,000                          | `"serial"`         | Fits in memory on laptops; easiest for debugging                                                         |
| 5,000 – 50,000                   | `"parallel"`       | Start with `jobs` at half your physical cores; more workers duplicate the source dataset in memory       |
| > 50,000 or statewide/nationwide | `"dask"`           | Start a distributed/single-machine Dask cluster, persist the source grid, and let workers process chunks |

Additional tips:

- **Chunk target polygons**: The parallel and Dask engines already process polygons in batches internally.
  Manual chunking is optional when you want to checkpoint intermediate results, mix different `jobs` settings,
  or keep individual weight files per region. Split large GeoDataFrames into batches (for example, 2,500–10,000
  polygons per batch) and call `WeightGen.calculate_weights()` on each batch only if you need that extra
  control.
- **Heed runtime warnings**: When `WeightGen` detects more than 50,000 polygons with the serial/parallel
  engines, or when `jobs=-1` would duplicate large raster datasets across workers, it raises `RuntimeWarning`
  messages. Treat those warnings as a signal to switch engines or lower `jobs` before memory becomes a
  bottleneck.
- **Tune `jobs` carefully**: Every worker holds its own copy of the source grid. Setting `jobs=-1` (all CPU
  cores) often causes out-of-memory errors on nationwide rasters; increase workers only after monitoring
  real memory usage. If you request more workers than physical CPUs, `WeightGen` now caps the value and
  emits a warning so you know the engine throttled it.
- **Persist gridded data once**: When using `NHGFStacData` or `UserCatData`, open the dataset before chunking
  and pass the same `user_data` object to every batch to prevent repeated reads.
- **Use Dask heuristics**: If both `len(user_data.target_gdf) > 50_000` and the grid resolution is ≤ 5 km,
  plan on scheduling with Dask; the serial/parallel engines will spill to memory.
- **Avoid intersections unless necessary**: `intersections=True` multiplies memory use; only enable it when
  you explicitly need geometries for QA/QC.

#### Optional chunking workflow example

```python
import geopandas as gpd
import pandas as pd
from gdptools import WeightGen

TARGET_BATCH_SIZE = 5000
target_gdf = gpd.read_file("conus_huc12.gpkg")

def iter_batches(gdf, batch_size):
        for start in range(0, len(gdf), batch_size):
                yield gdf.iloc[start : start + batch_size]

weight_tables = []
for batch in iter_batches(target_gdf, TARGET_BATCH_SIZE):
        user_data.target_gdf = batch  # reuse prepared UserData instance
        wg = WeightGen(user_data=user_data, method="parallel", jobs=4, weight_gen_crs=6931)  # tune jobs to memory
        weight_tables.append(wg.calculate_weights())

weights = pd.concat(weight_tables, ignore_index=True)
weights.to_parquet("weights_conus_huc12.parquet")
```

#### Dask cluster quick start

```python
from dask.distributed import Client
from gdptools import WeightGen

client = Client(n_workers=8, threads_per_worker=1, memory_limit="8GB")
weight_gen = WeightGen(
        user_data=user_data,
        method="dask",
        weight_gen_crs=6931,
        jobs=2,  # per-worker partitions; increase gradually to avoid duplicating large arrays
)
weights = weight_gen.calculate_weights()
```

Monitor the [Dask dashboard](http://localhost:8787) to ensure workers stay below the available
memory. Increase the number of workers instead of threads per worker when topology calculations dominate.

### Memory Management

- **Set appropriate jobs parameter**: Balance between speed and memory usage
- **Save weights to files**: Avoid recalculating expensive weight operations
- **Use intersections=True judiciously**: Only when detailed geometry is needed

### Data Validation

- **Check CRS consistency**: Ensure source and target data have proper CRS information
- **Validate geometry**: Remove invalid polygons before weight calculation
- **Test with subsets**: Verify results with smaller datasets before full processing

### Weight File Management

- **Descriptive filenames**: Include dataset, CRS, and date information
- **Version control**: Track weight files with data provenance
- **Backup important weights**: Weight calculation can be expensive to repeat

```{note}
Weight generation is typically a one-time operation for a given source-target pair. Save weights to files for reuse across multiple analyses to avoid expensive recalculation.
```

```{warning}
Always use equal-area coordinate reference systems for weight calculations. Using geographic coordinates (latitude/longitude) will result in inaccurate area calculations, especially at high latitudes.
```

```{tip}
For large datasets, consider using the Dask method with a properly configured Dask cluster for optimal performance. The parallel method is usually sufficient for most desktop computing scenarios.
```

---

## Common Use Cases

### Climate Data Analysis

- **Gridded climate data → Administrative boundaries**: Weather/climate statistics by county, state, or watershed
- **Model output → Ecological regions**: Climate model results aggregated to ecoregions
- **Reanalysis data → Custom polygons**: Historical climate data for user-defined areas

### Environmental Assessment

- **Satellite data → Land management units**: Environmental monitoring by management area
- **Pollution models → Population centers**: Exposure assessment for urban areas
- **Ecosystem services → Political boundaries**: Natural resource accounting by jurisdiction

### Hydrological Applications

- **Precipitation grids → Catchments**: Rainfall statistics by watershed
- **Evapotranspiration → Irrigation districts**: Water balance calculations
- **Streamflow → Administrative units**: Water resource management applications