Weight Generation Classes#
Weight generation is the process of calculating spatial intersection weights between gridded datasets and vector polygon geometries. These weights represent the proportional area of overlap and are fundamental for performing accurate area-weighted statistical aggregations in climate and environmental data analysis.
gdptools provides comprehensive weight generation capabilities:
WeightGen: Grid-to-polygon intersection weight calculation with multiple processing enginesWeightGenP2P: Polygon-to-polygon intersection weight calculation for complex geometries
Both classes support various processing methods and can handle large datasets efficiently through parallel and distributed computing options.
Key Features#
Spatial Intersection Methods#
Grid-to-polygon: Calculate weights between regular grids and irregular polygons
Polygon-to-polygon: Calculate weights between two sets of polygon geometries
Accurate area calculations: Proper handling of coordinate reference systems and projections
Processing Engines#
Serial: Sequential processing for smaller datasets or debugging
Parallel: Multi-core processing for improved performance with many polygons
Dask: Distributed computing for very large datasets or cluster environments
Output Options#
CSV files: Save weights to disk for reuse in multiple analyses
In-memory DataFrames: Direct processing without file I/O
Intersection geometries: Optional detailed spatial intersection information
Grid-to-Polygon (WeightGen)#
The WeightGen class calculates intersection weights between gridded datasets (e.g., NetCDF, Zarr) and vector polygons. This is the most common use case for climate and environmental data analysis.
- class WeightGen(*, user_data, method, weight_gen_crs, output_file=None, jobs=-1, verbose=False)[source]
Bases:
objectCalculates grid-to-polygon intersection weights for area-weighted aggregation.
This class computes spatial intersection weights between gridded datasets and vector polygon geometries. The weights represent the proportional area of overlap and are essential for accurate area-weighted statistical aggregations.
- Parameters:
user_data (UserData) – Input data container with source grid and target polygons.
method (str) – Processing method for weight calculation.
weight_gen_crs (str | int | CRS) – Coordinate reference system for weight calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights CSV file. If None, weights are not saved.
jobs (int | None) – Number of processors for parallel or dask methods. -1 uses all available.
verbose (bool | None) – Whether to print detailed processing information.
- grid_cells
GeoDataFrame of source grid cells after processing.
- intersections
GeoDataFrame of polygon intersections (if calculated).
- Raises:
TypeError – If method is not one of the supported processing methods.
Examples
- Basic serial weight calculation:
>>> weight_gen = WeightGen( ... user_data=my_data, ... method="serial", ... weight_gen_crs=6931 ... ) >>> weights = weight_gen.calculate_weights()
- Parallel processing with file output:
>>> weight_gen = WeightGen( ... user_data=my_data, ... method="parallel", ... weight_gen_crs=6931, ... output_file="weights.csv", ... jobs=4, ... verbose=True ... ) >>> weights = weight_gen.calculate_weights()
Initialize the WeightGen class with configuration parameters.
Sets up the weight generation system by configuring the processing method, coordinate reference system, and output options.
- Parameters:
user_data (UserData) – Input data container with source grid and target polygons. Must be an instance of UserData subclass (UserCatData, NHGFStacData, etc.).
method (str) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’).
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses all available.
verbose (bool | None) – If True, prints detailed processing information during execution.
- Raises:
TypeError – If method is not one of the supported processing methods.
- __init__(*, user_data, method, weight_gen_crs, output_file=None, jobs=-1, verbose=False)[source]
Initialize the WeightGen class with configuration parameters.
Sets up the weight generation system by configuring the processing method, coordinate reference system, and output options.
- Parameters:
user_data (UserData) – Input data container with source grid and target polygons. Must be an instance of UserData subclass (UserCatData, NHGFStacData, etc.).
method (str) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’).
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses all available.
verbose (bool | None) – If True, prints detailed processing information during execution.
- Raises:
TypeError – If method is not one of the supported processing methods.
- calculate_weights(intersections=False)[source]
Calculate spatial intersection weights between grid and polygons.
Computes area-weighted intersection weights for each target polygon with source grid cells. The weights represent the proportional area of overlap and are used for accurate area-weighted aggregations.
- Parameters:
intersections (bool) – If True, calculate and store detailed intersection geometries between target and source polygons. This provides additional spatial information but increases memory usage.
- Returns:
- The calculated weights with columns such as
target_id, i_index, j_index, and weight (0.0-1.0).
- Return type:
Notes
Processing time depends on the number of polygons and grid resolution. Use intersections=True only when detailed geometric information is needed.
Examples
>>> weights = weight_gen.calculate_weights() >>> print(f"Calculated {len(weights)} weight entries")
>>> # With intersection details >>> weights = weight_gen.calculate_weights(intersections=True) >>> intersections = weight_gen.intersections
- property grid_cells: GeoDataFrame
Get the source grid cells as a GeoDataFrame.
Returns the grid cells used in weight calculations. These represent the source grid geometry after processing and CRS transformation.
- Returns:
Source grid cells with spatial geometry and grid indices (i_index, j_index).
- Return type:
Notes
This property is only populated after calling calculate_weights(). If accessed before weight calculation, a message will be printed.
- property intersections: GeoDataFrame
Get the polygon intersection geometries as a GeoDataFrame.
Returns the detailed intersection geometries between target polygons and source grid cells. This provides the actual spatial overlap areas used in weight calculations.
- Returns:
Intersection geometries with target/source identifiers and calculated areas.
- Return type:
Notes
This property is only populated after calling calculate_weights(intersections=True). If accessed otherwise, a message will be printed.
WeightGen Examples#
Basic Weight Calculation#
# Basic weight generation with serial processing
from gdptools.weight_gen import WeightGen
weight_gen = WeightGen(
user_data=climate_data,
method="serial",
weight_gen_crs=6931 # Equal-area projection
)
weights = weight_gen.calculate_weights()
# Display results
print(f"Generated {len(weights)} weight entries")
print(f"Target polygons: {weights['target_id'].nunique()}")
print(f"Grid cells involved: {len(weights[['i_index', 'j_index']].drop_duplicates())}")
Parallel Processing with File Output#
# Parallel processing with weights saved to file
weight_gen = WeightGen(
user_data=climate_data,
method="parallel",
weight_gen_crs=6931,
output_file="climate_weights.csv",
jobs=4,
verbose=True
)
weights = weight_gen.calculate_weights()
# Access grid cell information
grid_cells = weight_gen.grid_cells
print(f"Grid cells: {len(grid_cells)} cells")
Advanced: Intersection Geometries#
# Calculate detailed intersection geometries
weight_gen = WeightGen(
user_data=climate_data,
method="parallel",
weight_gen_crs=6931,
jobs=4
)
weights = weight_gen.calculate_weights(intersections=True)
# Access detailed intersection information
intersections = weight_gen.intersections
print(f"Intersection polygons: {len(intersections)}")
Polygon-to-Polygon (WeightGenP2P)#
The WeightGenP2P class calculates intersection weights between two sets of polygon geometries. This is essential for transferring data between different administrative boundaries, ecological regions, or other polygon-based spatial datasets.
- class WeightGenP2P(*, target_poly, target_poly_idx, source_poly, source_poly_idx, method, weight_gen_crs, output_file=None, jobs=-1, intersections=False, verbose=False)[source]
Bases:
objectCalculates polygon-to-polygon intersection weights for spatial data transfer.
This class computes spatial intersection weights between two sets of polygon geometries, enabling accurate transfer of data between different administrative boundaries, ecological regions, or other polygon-based spatial datasets.
- Parameters:
target_poly (GeoDataFrame) – GeoDataFrame containing target polygons for weight calculation.
target_poly_idx (str) – Column name for unique identifiers of target polygons.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons for weight calculation.
source_poly_idx (str | list[str]) – Column name(s) for unique identifiers of source polygons.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation.
weight_gen_crs (str | int | CRS) – Coordinate reference system for weight calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights CSV file. If None, weights are not saved.
jobs (int | None) – Number of processors for parallel or dask methods. -1 uses all available.
intersections (bool) – Whether to calculate and store detailed intersection geometries.
verbose (bool) – Whether to print detailed processing information.
- intersections
GeoDataFrame of polygon intersections (if calculated).
- Raises:
TypeError – If method is not one of the supported processing methods.
Examples
- Basic polygon-to-polygon weights:
>>> weight_gen = WeightGenP2P( ... target_poly=watersheds, ... target_poly_idx="watershed_id", ... source_poly=counties, ... source_poly_idx="county_id", ... method="serial", ... weight_gen_crs=5070 ... ) >>> wght = weight_gen.calculate_weights()
- Parallel processing with intersections:
>>> weight_gen = WeightGenP2P( ... target_poly=regions, ... target_poly_idx="region_id", ... source_poly=zones, ... source_poly_idx="zone_id", ... method="parallel", ... weight_gen_crs=5070, ... intersections=True, ... jobs=4 ... ) >>> wght = weight_gen.calculate_weights() >>> intersections_gdf = weight_gen.intersections
Initialize the WeightGenP2P class with configuration parameters.
Sets up the polygon-to-polygon weight generation system by configuring the source and target geometries, processing method, and output options.
- Parameters:
target_poly (GeoDataFrame) – GeoDataFrame containing target polygons. Must include the column specified in target_poly_idx and geometry column.
target_poly_idx (str) – Column name for target polygon unique identifiers.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons. Must include the column(s) specified in source_poly_idx and geometry column.
source_poly_idx (str | list[str]) – Column name(s) for source polygon unique identifiers. Can be a single column name or list of column names.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’).
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses half available.
intersections (bool) – If True, calculate and store detailed intersection geometries.
verbose (bool) – If True, prints detailed processing information during execution.
- Raises:
TypeError – If method is not one of the supported processing methods.
Notes
Input polygons are automatically dissolved by their ID columns and sorted for consistent processing. Invalid geometries should be cleaned beforehand.
- __init__(*, target_poly, target_poly_idx, source_poly, source_poly_idx, method, weight_gen_crs, output_file=None, jobs=-1, intersections=False, verbose=False)[source]
Initialize the WeightGenP2P class with configuration parameters.
Sets up the polygon-to-polygon weight generation system by configuring the source and target geometries, processing method, and output options.
- Parameters:
target_poly (GeoDataFrame) – GeoDataFrame containing target polygons. Must include the column specified in target_poly_idx and geometry column.
target_poly_idx (str) – Column name for target polygon unique identifiers.
source_poly (GeoDataFrame) – GeoDataFrame containing source polygons. Must include the column(s) specified in source_poly_idx and geometry column.
source_poly_idx (str | list[str]) – Column name(s) for source polygon unique identifiers. Can be a single column name or list of column names.
method (Literal['serial', 'parallel', 'dask']) – Processing method for weight calculation (‘serial’, ‘parallel’, ‘dask’).
weight_gen_crs (str | int | CRS) – Coordinate reference system for calculations. Accepts EPSG codes, WKT strings, or pyproj CRS objects.
output_file (str | None) – Path to save weights as CSV file. If None, no file is saved.
jobs (int | None) – Number of processors for parallel/dask methods. -1 uses half available.
intersections (bool) – If True, calculate and store detailed intersection geometries.
verbose (bool) – If True, prints detailed processing information during execution.
- Raises:
TypeError – If method is not one of the supported processing methods.
Notes
Input polygons are automatically dissolved by their ID columns and sorted for consistent processing. Invalid geometries should be cleaned beforehand.
- calculate_weights()[source]
Calculate spatial intersection weights between polygon sets.
Computes area-weighted intersection weights between target and source polygons. The weights represent the proportional area contribution of each source polygon to each target polygon.
- Returns:
A DataFrame containing the calculated weights with columns:
- Return type:
pd.DataFrame
target_id: Identifier for the target polygon.source_id: Identifier for the source polygon.wght: Proportional area of the source polygon within the target (0.0-1.0).source_id_area: Total area of the source polygon (for extensive variables).target_id_area: Total area of the target polygon (for diagnostics).area_weight: Area of the intersection.
Notes
For spatially continuous source polygons without gaps or overlaps, the
wghtvalues for each target polygon should sum to 1.0.Examples
>>> wght = weight_gen.calculate_weights() >>> print(f"Calculated {len(wght)} weight entries") >>> print(f"Weight range: {wght['wght'].min():.4f} to {wght['wght'].max():.4f}")
>>> # Verify weights sum to 1 for each target (if source is continuous) >>> weight_sums = wght.groupby('target_id')["wght"].sum() >>> print(f"Weight sum range: {weight_sums.min():.4f} to {weight_sums.max():.4f}")
- property intersections: GeoDataFrame
Get the polygon intersection geometries as a GeoDataFrame.
Returns the detailed intersection geometries between target and source polygons. These represent the actual spatial overlap areas used in weight calculations.
- Returns:
A geopandas GeoDataFrame containing intersection geometries with target and source identifiers, calculated areas, and intersection polygons.
Notes
This property is only populated after calling calculate_weights() with intersections=True. If accessed otherwise, a message will be printed.
Examples
>>> weight_gen = WeightGenP2P(..., intersections=True) >>> wght = weight_gen.calculate_weights() >>> intersections = weight_gen.intersections >>> print(f"Intersection areas: {intersections.geometry.area.describe()}")
WeightGenP2P Examples#
Basic Polygon-to-Polygon Weights#
# Calculate weights between two polygon datasets
from gdptools.weight_gen_p2p import WeightGenP2P
weight_gen = WeightGenP2P(
target_poly=watersheds,
target_poly_idx="watershed_id",
source_poly=counties,
source_poly_idx="county_id",
method="serial",
weight_gen_crs=5070 # Equal-area projection
)
weights = weight_gen.calculate_weights()
# Analyze results
print(f"Watershed-county intersections: {len(weights)}")
print(f"Weight statistics: min={weights['weight'].min():.4f}, max={weights['weight'].max():.4f}")
# Verify weights sum to 1 for each target (if source is continuous)
weight_sums = weights.groupby('target_id')['weight'].sum()
print(f"Weight sum range: {weight_sums.min():.4f} to {weight_sums.max():.4f}")
Advanced Configuration with Intersections#
# Parallel processing with intersection geometries
weight_gen = WeightGenP2P(
target_poly=target_regions,
target_poly_idx="region_id",
source_poly=source_zones,
source_poly_idx="zone_id",
method="parallel",
weight_gen_crs=5070,
output_file="region_zone_weights.csv",
intersections=True,
jobs=6,
verbose=True
)
weights = weight_gen.calculate_weights()
# Access detailed intersection geometries
intersections = weight_gen.intersections
print(f"Generated {len(intersections)} intersection polygons")
print(f"Intersection areas: {intersections.geometry.area.describe()}")
Type Definitions#
The modules provide literal types for processing method configuration:
- WEIGHT_GEN_METHODS#
Available weight generation processing methods.
- Options:
- serial: Sequential processing through polygons one by one.
Best for small datasets or debugging.
- parallel: Multi-core processing with polygon chunks distributed across CPUs.
Optimal for moderate datasets with many polygons.
- dask: Distributed computing using Dask framework.
Ideal for large datasets or cluster environments.
Note
Choose the method based on your computational resources and dataset size. Serial is most reliable, parallel offers good speedup for many polygons, and dask excels with very large datasets or when a Dask client is available.
Examples
>>> method = "serial" # For smaller datasets >>> method = "parallel" # For larger datasets >>> method = "dask" # For very large or distributed datasets
alias of
Literal[‘serial’, ‘parallel’, ‘dask’]
- WEIGHT_GEN_METHODS#
Available polygon-to-polygon weight generation processing methods.
- Options:
- serial: Sequential processing through polygon pairs one by one.
Best for small datasets with few polygons or debugging.
- parallel: Multi-core processing with polygon chunks distributed across CPUs.
Optimal for moderate datasets with many polygon intersections.
- dask: Distributed computing using Dask framework.
Ideal for very large datasets or cluster environments.
Notes
Choose the method based on your computational resources and dataset complexity. Serial is most reliable, parallel offers good speedup for many intersections, and dask excels with very large polygon datasets or distributed computing.
Examples
>>> method = "serial" # For smaller datasets >>> method = "parallel" # For larger datasets >>> method = "dask" # For very large or distributed datasets
alias of
Literal[‘serial’, ‘parallel’, ‘dask’]
Best Practices#
Coordinate Reference System Selection#
Use equal-area projections for accurate area calculations (e.g., Albers Equal Area, Lambert Azimuthal)
Choose appropriate regional projections for your study area
Common choices: EPSG:5070 (CONUS Albers), EPSG:6931 (CONUS Albers WGS84)
Performance Optimization#
Serial method: Use for small datasets (<5000 polygons) or debugging
Parallel method: Optimal for moderate datasets (5000-50,000 polygons)
Dask method: Use for very large datasets (>50,000 polygons) or cluster computing
Scaling to Nationwide Datasets#
Nationwide basins, parcels, or census geometries often contain hundreds of thousands of polygons and millions of grid cells. At this scale, keep memory in check by combining polygon-count heuristics with out-of-core execution:
Target polygon count |
Recommended engine |
Guidance |
|---|---|---|
< 5,000 |
|
Fits in memory on laptops; easiest for debugging |
5,000 – 50,000 |
|
Start with |
> 50,000 or statewide/nationwide |
|
Start a distributed/single-machine Dask cluster, persist the source grid, and let workers process chunks |
Additional tips:
Chunk target polygons: The parallel and Dask engines already process polygons in batches internally. Manual chunking is optional when you want to checkpoint intermediate results, mix different
jobssettings, or keep individual weight files per region. Split large GeoDataFrames into batches (for example, 2,500–10,000 polygons per batch) and callWeightGen.calculate_weights()on each batch only if you need that extra control.Heed runtime warnings: When
WeightGendetects more than 50,000 polygons with the serial/parallel engines, or whenjobs=-1would duplicate large raster datasets across workers, it raisesRuntimeWarningmessages. Treat those warnings as a signal to switch engines or lowerjobsbefore memory becomes a bottleneck.Tune
jobscarefully: Every worker holds its own copy of the source grid. Settingjobs=-1(all CPU cores) often causes out-of-memory errors on nationwide rasters; increase workers only after monitoring real memory usage. If you request more workers than physical CPUs,WeightGennow caps the value and emits a warning so you know the engine throttled it.Persist gridded data once: When using
NHGFStacDataorUserCatData, open the dataset before chunking and pass the sameuser_dataobject to every batch to prevent repeated reads.Use Dask heuristics: If both
len(user_data.target_gdf) > 50_000and the grid resolution is ≤ 5 km, plan on scheduling with Dask; the serial/parallel engines will spill to memory.Avoid intersections unless necessary:
intersections=Truemultiplies memory use; only enable it when you explicitly need geometries for QA/QC.
Optional chunking workflow example#
import geopandas as gpd
import pandas as pd
from gdptools import WeightGen
TARGET_BATCH_SIZE = 5000
target_gdf = gpd.read_file("conus_huc12.gpkg")
def iter_batches(gdf, batch_size):
for start in range(0, len(gdf), batch_size):
yield gdf.iloc[start : start + batch_size]
weight_tables = []
for batch in iter_batches(target_gdf, TARGET_BATCH_SIZE):
user_data.target_gdf = batch # reuse prepared UserData instance
wg = WeightGen(user_data=user_data, method="parallel", jobs=4, weight_gen_crs=6931) # tune jobs to memory
weight_tables.append(wg.calculate_weights())
weights = pd.concat(weight_tables, ignore_index=True)
weights.to_parquet("weights_conus_huc12.parquet")
Dask cluster quick start#
from dask.distributed import Client
from gdptools import WeightGen
client = Client(n_workers=8, threads_per_worker=1, memory_limit="8GB")
weight_gen = WeightGen(
user_data=user_data,
method="dask",
weight_gen_crs=6931,
jobs=2, # per-worker partitions; increase gradually to avoid duplicating large arrays
)
weights = weight_gen.calculate_weights()
Monitor the Dask dashboard to ensure workers stay below the available memory. Increase the number of workers instead of threads per worker when topology calculations dominate.
Memory Management#
Set appropriate jobs parameter: Balance between speed and memory usage
Save weights to files: Avoid recalculating expensive weight operations
Use intersections=True judiciously: Only when detailed geometry is needed
Data Validation#
Check CRS consistency: Ensure source and target data have proper CRS information
Validate geometry: Remove invalid polygons before weight calculation
Test with subsets: Verify results with smaller datasets before full processing
Weight File Management#
Descriptive filenames: Include dataset, CRS, and date information
Version control: Track weight files with data provenance
Backup important weights: Weight calculation can be expensive to repeat
Note
Weight generation is typically a one-time operation for a given source-target pair. Save weights to files for reuse across multiple analyses to avoid expensive recalculation.
Warning
Always use equal-area coordinate reference systems for weight calculations. Using geographic coordinates (latitude/longitude) will result in inaccurate area calculations, especially at high latitudes.
Tip
For large datasets, consider using the Dask method with a properly configured Dask cluster for optimal performance. The parallel method is usually sufficient for most desktop computing scenarios.
Common Use Cases#
Climate Data Analysis#
Gridded climate data → Administrative boundaries: Weather/climate statistics by county, state, or watershed
Model output → Ecological regions: Climate model results aggregated to ecoregions
Reanalysis data → Custom polygons: Historical climate data for user-defined areas
Environmental Assessment#
Satellite data → Land management units: Environmental monitoring by management area
Pollution models → Population centers: Exposure assessment for urban areas
Ecosystem services → Political boundaries: Natural resource accounting by jurisdiction
Hydrological Applications#
Precipitation grids → Catchments: Rainfall statistics by watershed
Evapotranspiration → Irrigation districts: Water balance calculations
Streamflow → Administrative units: Water resource management applications