Weight Generation Classes

Weight Generation Classes#

Weight generation is the process of calculating spatial intersection weights between gridded datasets and vector polygon geometries. These weights represent the proportional area of overlap and are fundamental for performing accurate area-weighted statistical aggregations in climate and environmental data analysis.

gdptools provides comprehensive weight generation capabilities:

WeightGen: Grid-to-polygon intersection weight calculation with multiple processing engines
WeightGenP2P: Polygon-to-polygon intersection weight calculation for complex geometries

Both classes support various processing methods and can handle large datasets efficiently through parallel and distributed computing options.

Key Features #

Spatial Intersection Methods #

Grid-to-polygon: Calculate weights between regular grids and irregular polygons
Polygon-to-polygon: Calculate weights between two sets of polygon geometries
Accurate area calculations: Proper handling of coordinate reference systems and projections

Processing Engines #

Serial: Sequential processing for smaller datasets or debugging
Parallel: Multi-core processing for improved performance with many polygons
Dask: Distributed computing for very large datasets or cluster environments

Output Options #

CSV files: Save weights to disk for reuse in multiple analyses
In-memory DataFrames: Direct processing without file I/O
Intersection geometries: Optional detailed spatial intersection information

Grid-to-Polygon (`WeightGen`)#

The WeightGen class calculates intersection weights between gridded datasets (e.g., NetCDF, Zarr) and vector polygons. This is the most common use case for climate and environmental data analysis.

class WeightGen(*, user_data, method, weight_gen_crs, output_file=None, jobs=-1, verbose=False)[source]

Bases: object

Calculates grid-to-polygon intersection weights for area-weighted aggregation.

This class computes spatial intersection weights between gridded datasets and vector polygon geometries. The weights represent the proportional area of overlap and are essential for accurate area-weighted statistical aggregations.

Parameters:

user_data (UserData) – Input data container with source grid and target polygons.
method (str) – Processing method for weight calculation.
weight_gen_crs (str | int | CRS) – Coordinate reference system for weight calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights CSV file. If None, weights are not saved.
jobs (int | None) – Number of processors for parallel or dask methods. -1 uses all available.
verbose (bool | None) – Whether to print detailed processing information.

grid_cells: GeoDataFrame of source grid cells after processing.

intersections: GeoDataFrame of polygon intersections (if calculated).

Raises:: TypeError – If method is not one of the supported processing methods.

Examples

Basic serial weight calculation:

>>> weight_gen = WeightGen(
...     user_data=my_data,
...     method="serial",
...     weight_gen_crs=6931
... )
>>> weights = weight_gen.calculate_weights()

Parallel processing with file output:

>>> weight_gen = WeightGen(
...     user_data=my_data,
...     method="parallel",
...     weight_gen_crs=6931,
...     output_file="weights.csv",
...     jobs=4,
...     verbose=True
... )
>>> weights = weight_gen.calculate_weights()

Initialize the WeightGen class with configuration parameters.

Sets up the weight generation system by configuring the processing method, coordinate reference system, and output options.

Parameters:

user_data (UserData) – Input data container with source grid and target polygons. Must be an instance of UserData subclass (UserCatData, NHGFStacData, etc.).
method (str) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’). (‘dask’ is deprecated; removal in 0.4.0.)
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses all available.
verbose (bool | None) – If True, prints detailed processing information during execution.

Raises:

TypeError – If method is not one of the supported processing methods.

__init__(*, user_data, method, weight_gen_crs, output_file=None, jobs=-1, verbose=False)[source]

Initialize the WeightGen class with configuration parameters.

Sets up the weight generation system by configuring the processing method, coordinate reference system, and output options.

Parameters:

user_data (UserData) – Input data container with source grid and target polygons. Must be an instance of UserData subclass (UserCatData, NHGFStacData, etc.).
method (str) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’). (‘dask’ is deprecated; removal in 0.4.0.)
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses all available.
verbose (bool | None) – If True, prints detailed processing information during execution.

Raises:

TypeError – If method is not one of the supported processing methods.

calculate_weights(intersections=False)[source]

Calculate spatial intersection weights between grid and polygons.

Computes area-weighted intersection weights for each target polygon with source grid cells. The weights represent the proportional area of overlap and are used for accurate area-weighted aggregations.

Parameters:

intersections (bool) – If True, calculate and store detailed intersection geometries between target and source polygons. This provides additional spatial information but increases memory usage.

Returns:

The calculated weights with columns such as: target_id, i_index, j_index, and weight (0.0-1.0).

Return type:

pandas.DataFrame

Notes

Processing time depends on the number of polygons and grid resolution. Use intersections=True only when detailed geometric information is needed.

Examples

>>> weights = weight_gen.calculate_weights()
>>> print(f"Calculated {len(weights)} weight entries")

>>> # With intersection details
>>> weights = weight_gen.calculate_weights(intersections=True)
>>> intersections = weight_gen.intersections

property grid_cells: GeoDataFrame

Get the source grid cells as a GeoDataFrame.

Returns the grid cells used in weight calculations. These represent the source grid geometry after processing and CRS transformation.

Returns:: Source grid cells with spatial geometry and grid indices (i_index, j_index).
Return type:: geopandas.GeoDataFrame

Notes

This property is only populated after calling calculate_weights(). If accessed before weight calculation, a message will be printed.

property intersections: GeoDataFrame

Get the polygon intersection geometries as a GeoDataFrame.

Returns the detailed intersection geometries between target polygons and source grid cells. This provides the actual spatial overlap areas used in weight calculations.

Returns:: Intersection geometries with target/source identifiers and calculated areas.
Return type:: geopandas.GeoDataFrame

Notes

This property is only populated after calling calculate_weights(intersections=True). If accessed otherwise, a message will be printed.

WeightGen Examples #

Basic Weight Calculation#

# Basic weight generation with serial processing
from gdptools.weight_gen import WeightGen

weight_gen = WeightGen(
    user_data=climate_data,
    method="serial",
    weight_gen_crs=6931  # Equal-area projection
)
weights = weight_gen.calculate_weights()

# Display results
print(f"Generated {len(weights)} weight entries")
print(f"Target polygons: {weights['target_id'].nunique()}")
print(f"Grid cells involved: {len(weights[['i_index', 'j_index']].drop_duplicates())}")

Parallel Processing with File Output#

# Parallel processing with weights saved to file
weight_gen = WeightGen(
    user_data=climate_data,
    method="parallel",
    weight_gen_crs=6931,
    output_file="climate_weights.csv",
    jobs=4,
    verbose=True
)
weights = weight_gen.calculate_weights()

# Access grid cell information
grid_cells = weight_gen.grid_cells
print(f"Grid cells: {len(grid_cells)} cells")

Advanced: Intersection Geometries#

# Calculate detailed intersection geometries
weight_gen = WeightGen(
    user_data=climate_data,
    method="parallel",
    weight_gen_crs=6931,
    jobs=4
)
weights = weight_gen.calculate_weights(intersections=True)

# Access detailed intersection information
intersections = weight_gen.intersections
print(f"Intersection polygons: {len(intersections)}")

Polygon-to-Polygon (`WeightGenP2P`)#

The WeightGenP2P class calculates intersection weights between two sets of polygon geometries. This is essential for transferring data between different administrative boundaries, ecological regions, or other polygon-based spatial datasets.

class WeightGenP2P(*, target_poly, target_poly_idx, source_poly, source_poly_idx, method, weight_gen_crs, output_file=None, jobs=-1, intersections=False, verbose=False)[source]

Bases: object

Calculates polygon-to-polygon intersection weights for spatial data transfer.

This class computes spatial intersection weights between two sets of polygon geometries, enabling accurate transfer of data between different administrative boundaries, ecological regions, or other polygon-based spatial datasets.

Parameters:

target_poly (GeoDataFrame) – GeoDataFrame containing target polygons for weight calculation.
target_poly_idx (str) – Column name for unique identifiers of target polygons.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons for weight calculation.
source_poly_idx (str | list[str]) – Column name(s) for unique identifiers of source polygons.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation.
weight_gen_crs (str | int | CRS) – Coordinate reference system for weight calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights CSV file. If None, weights are not saved.
jobs (int | None) – Number of processors for parallel or dask methods. -1 uses all available.
intersections (bool) – Whether to calculate and store detailed intersection geometries.
verbose (bool) – Whether to print detailed processing information.

intersections: GeoDataFrame of polygon intersections (if calculated).

Raises:: TypeError – If method is not one of the supported processing methods.

Examples

Basic polygon-to-polygon weights:

>>> weight_gen = WeightGenP2P(
...     target_poly=watersheds,
...     target_poly_idx="watershed_id",
...     source_poly=counties,
...     source_poly_idx="county_id",
...     method="serial",
...     weight_gen_crs=5070
... )
>>> wght = weight_gen.calculate_weights()

Parallel processing with intersections:

>>> weight_gen = WeightGenP2P(
...     target_poly=regions,
...     target_poly_idx="region_id",
...     source_poly=zones,
...     source_poly_idx="zone_id",
...     method="parallel",
...     weight_gen_crs=5070,
...     intersections=True,
...     jobs=4
... )
>>> wght = weight_gen.calculate_weights()
>>> intersections_gdf = weight_gen.intersections

Initialize the WeightGenP2P class with configuration parameters.

Sets up the polygon-to-polygon weight generation system by configuring the source and target geometries, processing method, and output options.

Parameters:

target_poly (GeoDataFrame) – GeoDataFrame containing target polygons. Must include the column specified in target_poly_idx and geometry column.
target_poly_idx (str) – Column name for target polygon unique identifiers.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons. Must include the column(s) specified in source_poly_idx and geometry column.
source_poly_idx (str | list[str]) – Column name(s) for source polygon unique identifiers. Can be a single column name or list of column names.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’). (‘dask’ is deprecated; removal in 0.4.0.)
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses half available.
intersections (bool) – If True, calculate and store detailed intersection geometries.
verbose (bool) – If True, prints detailed processing information during execution.

Raises:

TypeError – If method is not one of the supported processing methods.

Notes

Input polygons are automatically dissolved by their ID columns and sorted for consistent processing. Invalid geometries should be cleaned beforehand.

__init__(*, target_poly, target_poly_idx, source_poly, source_poly_idx, method, weight_gen_crs, output_file=None, jobs=-1, intersections=False, verbose=False)[source]

Initialize the WeightGenP2P class with configuration parameters.

Sets up the polygon-to-polygon weight generation system by configuring the source and target geometries, processing method, and output options.

Parameters:

target_poly (GeoDataFrame) – GeoDataFrame containing target polygons. Must include the column specified in target_poly_idx and geometry column.
target_poly_idx (str) – Column name for target polygon unique identifiers.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons. Must include the column(s) specified in source_poly_idx and geometry column.
source_poly_idx (str | list[str]) – Column name(s) for source polygon unique identifiers. Can be a single column name or list of column names.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’). (‘dask’ is deprecated; removal in 0.4.0.)
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses half available.
intersections (bool) – If True, calculate and store detailed intersection geometries.
verbose (bool) – If True, prints detailed processing information during execution.

Raises:

TypeError – If method is not one of the supported processing methods.

Notes

Input polygons are automatically dissolved by their ID columns and sorted for consistent processing. Invalid geometries should be cleaned beforehand.

calculate_weights()[source]

Calculate spatial intersection weights between polygon sets.

Computes area-weighted intersection weights between target and source polygons. The weights represent the proportional area contribution of each source polygon to each target polygon.

Returns:: A DataFrame containing the calculated weights with columns:
Return type:: pd.DataFrame

target_id: Identifier for the target polygon.
source_id: Identifier for the source polygon.
wght: Proportional area of the source polygon within the target (0.0-1.0).
source_id_area: Total area of the source polygon (for extensive variables).
target_id_area: Total area of the target polygon (for diagnostics).
area_weight: Area of the intersection.

Notes

For spatially continuous source polygons without gaps or overlaps, the wght values for each target polygon should sum to 1.0.

Examples

>>> wght = weight_gen.calculate_weights()
>>> print(f"Calculated {len(wght)} weight entries")
>>> print(f"Weight range: {wght['wght'].min():.4f} to {wght['wght'].max():.4f}")

>>> # Verify weights sum to 1 for each target (if source is continuous)
>>> weight_sums = wght.groupby('target_id')["wght"].sum()
>>> print(f"Weight sum range: {weight_sums.min():.4f} to {weight_sums.max():.4f}")

property intersections: GeoDataFrame

Get the polygon intersection geometries as a GeoDataFrame.

Returns the detailed intersection geometries between target and source polygons. These represent the actual spatial overlap areas used in weight calculations.

Returns:: A geopandas GeoDataFrame containing intersection geometries with target and source identifiers, calculated areas, and intersection polygons.

Notes

This property is only populated after calling calculate_weights() with intersections=True. If accessed otherwise, a message will be printed.

Examples

>>> weight_gen = WeightGenP2P(..., intersections=True)
>>> wght = weight_gen.calculate_weights()
>>> intersections = weight_gen.intersections
>>> print(f"Intersection areas: {intersections.geometry.area.describe()}")

WeightGenP2P Examples #

Basic Polygon-to-Polygon Weights#

# Calculate weights between two polygon datasets
from gdptools.weight_gen_p2p import WeightGenP2P

weight_gen = WeightGenP2P(
    target_poly=watersheds,
    target_poly_idx="watershed_id",
    source_poly=counties,
    source_poly_idx="county_id",
    method="serial",
    weight_gen_crs=5070  # Equal-area projection
)
weights = weight_gen.calculate_weights()

# Analyze results
print(f"Watershed-county intersections: {len(weights)}")
print(f"Weight statistics: min={weights['weight'].min():.4f}, max={weights['weight'].max():.4f}")

# Verify weights sum to 1 for each target (if source is continuous)
weight_sums = weights.groupby('target_id')['weight'].sum()
print(f"Weight sum range: {weight_sums.min():.4f} to {weight_sums.max():.4f}")

Advanced Configuration with Intersections#

# Parallel processing with intersection geometries
weight_gen = WeightGenP2P(
    target_poly=target_regions,
    target_poly_idx="region_id",
    source_poly=source_zones,
    source_poly_idx="zone_id",
    method="parallel",
    weight_gen_crs=5070,
    output_file="region_zone_weights.csv",
    intersections=True,
    jobs=6,
    verbose=True
)
weights = weight_gen.calculate_weights()

# Access detailed intersection geometries
intersections = weight_gen.intersections
print(f"Generated {len(intersections)} intersection polygons")
print(f"Intersection areas: {intersections.geometry.area.describe()}")

Type Definitions #

The modules provide literal types for processing method configuration:

WEIGHT_GEN_METHODS#

Available weight generation processing methods.

Options:

serial: Sequential processing through polygons one by one.: Best for small datasets or debugging.
parallel: Multi-core processing with polygon chunks distributed across CPUs.: Optimal for moderate datasets with many polygons.
dask: Distributed computing using Dask framework.: Ideal for large datasets or cluster environments. Deprecated — will be removed in gdptools 0.4.0. Use ‘parallel’ or ‘serial’.

Note

Choose the method based on your computational resources and dataset size. Serial is most reliable, parallel offers good speedup for many polygons, and dask excels with very large datasets or when a Dask client is available.

Examples

>>> method = "serial"  # For smaller datasets
>>> method = "parallel"  # For larger datasets
>>> method = "dask"  # For very large or distributed datasets  # deprecated

alias of Literal[‘serial’, ‘parallel’, ‘dask’]

WEIGHT_GEN_METHODS#

Available polygon-to-polygon weight generation processing methods.

Options:

serial: Sequential processing through polygon pairs one by one.: Best for small datasets with few polygons or debugging.
parallel: Multi-core processing with polygon chunks distributed across CPUs.: Optimal for moderate datasets with many polygon intersections.
dask: Distributed computing using Dask framework.: Ideal for very large datasets or cluster environments. Deprecated — will be removed in gdptools 0.4.0. Use ‘parallel’ or ‘serial’.

Notes

Choose the method based on your computational resources and dataset complexity. Serial is most reliable, parallel offers good speedup for many intersections, and dask excels with very large polygon datasets or distributed computing.

Examples

>>> method = "serial"  # For smaller datasets
>>> method = "parallel"  # For larger datasets
>>> method = "dask"  # For very large or distributed datasets  # deprecated

alias of Literal[‘serial’, ‘parallel’, ‘dask’]

Best Practices #

Coordinate Reference System Selection #

Use equal-area projections for accurate area calculations (e.g., Albers Equal Area, Lambert Azimuthal)
Choose appropriate regional projections for your study area
Common choices: EPSG:5070 (CONUS Albers), EPSG:6931 (CONUS Albers WGS84)

Performance Optimization #

Serial method: Use for small datasets (<5000 polygons) or debugging
Parallel method: Optimal for moderate datasets (5000-50,000 polygons)
Dask method: Use for very large datasets (>50,000 polygons) or cluster computing

Scaling to Nationwide Datasets #

Nationwide basins, parcels, or census geometries often contain hundreds of thousands of polygons and millions of grid cells. At this scale, keep memory in check by combining polygon-count heuristics with out-of-core execution:

Target polygon count	Recommended engine	Guidance
< 5,000	`"serial"`	Fits in memory on laptops; easiest for debugging
5,000 – 50,000	`"parallel"`	Start with `jobs` at half your physical cores; more workers duplicate the source dataset in memory
> 50,000 or statewide/nationwide	`"dask"`	Start a distributed/single-machine Dask cluster, persist the source grid, and let workers process chunks

Additional tips:

Chunk target polygons: The parallel and Dask engines already process polygons in batches internally. Manual chunking is optional when you want to checkpoint intermediate results, mix different jobs settings, or keep individual weight files per region. Split large GeoDataFrames into batches (for example, 2,500–10,000 polygons per batch) and call WeightGen.calculate_weights() on each batch only if you need that extra control.
Heed runtime warnings: When WeightGen detects more than 50,000 polygons with the serial/parallel engines, or when jobs=-1 would duplicate large raster datasets across workers, it raises RuntimeWarning messages. Treat those warnings as a signal to switch engines or lower jobs before memory becomes a bottleneck.
Tune jobs carefully: Every worker holds its own copy of the source grid. Setting jobs=-1 (all CPU cores) often causes out-of-memory errors on nationwide rasters; increase workers only after monitoring real memory usage. If you request more workers than physical CPUs, WeightGen now caps the value and emits a warning so you know the engine throttled it.
Persist gridded data once: When using NHGFStacData or UserCatData, open the dataset before chunking and pass the same user_data object to every batch to prevent repeated reads.
Use Dask heuristics: If both len(user_data.target_gdf) > 50_000 and the grid resolution is ≤ 5 km, plan on scheduling with Dask; the serial/parallel engines will spill to memory.
Avoid intersections unless necessary: intersections=True multiplies memory use; only enable it when you explicitly need geometries for QA/QC.

Optional chunking workflow example#

import geopandas as gpd
import pandas as pd
from gdptools import WeightGen

TARGET_BATCH_SIZE = 5000
target_gdf = gpd.read_file("conus_huc12.gpkg")

def iter_batches(gdf, batch_size):
        for start in range(0, len(gdf), batch_size):
                yield gdf.iloc[start : start + batch_size]

weight_tables = []
for batch in iter_batches(target_gdf, TARGET_BATCH_SIZE):
        user_data.target_gdf = batch  # reuse prepared UserData instance
        wg = WeightGen(user_data=user_data, method="parallel", jobs=4, weight_gen_crs=6931)  # tune jobs to memory
        weight_tables.append(wg.calculate_weights())

weights = pd.concat(weight_tables, ignore_index=True)
weights.to_parquet("weights_conus_huc12.parquet")

Dask cluster quick start#

from dask.distributed import Client
from gdptools import WeightGen

client = Client(n_workers=8, threads_per_worker=1, memory_limit="8GB")
weight_gen = WeightGen(
        user_data=user_data,
        method="dask",
        weight_gen_crs=6931,
        jobs=2,  # per-worker partitions; increase gradually to avoid duplicating large arrays
)
weights = weight_gen.calculate_weights()

Monitor the Dask dashboard to ensure workers stay below the available memory. Increase the number of workers instead of threads per worker when topology calculations dominate.

Memory Management #

Set appropriate jobs parameter: Balance between speed and memory usage
Save weights to files: Avoid recalculating expensive weight operations
Use intersections=True judiciously: Only when detailed geometry is needed

Data Validation #

Check CRS consistency: Ensure source and target data have proper CRS information
Validate geometry: Remove invalid polygons before weight calculation
Test with subsets: Verify results with smaller datasets before full processing

Weight File Management #

Descriptive filenames: Include dataset, CRS, and date information
Version control: Track weight files with data provenance
Backup important weights: Weight calculation can be expensive to repeat

Note

Weight generation is typically a one-time operation for a given source-target pair. Save weights to files for reuse across multiple analyses to avoid expensive recalculation.

Warning

Always use equal-area coordinate reference systems for weight calculations. Using geographic coordinates (latitude/longitude) will result in inaccurate area calculations, especially at high latitudes.

Tip

For large datasets, consider using the Dask method with a properly configured Dask cluster for optimal performance. The parallel method is usually sufficient for most desktop computing scenarios.

Common Use Cases #

Climate Data Analysis #

Gridded climate data → Administrative boundaries: Weather/climate statistics by county, state, or watershed
Model output → Ecological regions: Climate model results aggregated to ecoregions
Reanalysis data → Custom polygons: Historical climate data for user-defined areas

Environmental Assessment #

Satellite data → Land management units: Environmental monitoring by management area
Pollution models → Population centers: Exposure assessment for urban areas
Ecosystem services → Political boundaries: Natural resource accounting by jurisdiction

Hydrological Applications #

Precipitation grids → Catchments: Rainfall statistics by watershed
Evapotranspiration → Irrigation districts: Water balance calculations
Streamflow → Administrative units: Water resource management applications

Weight Generation Classes

Contents

Weight Generation Classes#

Basic Weight Calculation#

Parallel Processing with File Output#

Advanced: Intersection Geometries#

Basic Polygon-to-Polygon Weights#

Advanced Configuration with Intersections#

Optional chunking workflow example#

Dask cluster quick start#