Zonal Statistics Generation

Zonal Statistics Generation#

The gdptools.zonal_gen module provides powerful tools for computing zonal statistics on raster data. This module supports both standard and area-weighted zonal statistics calculations, with multiple computational engines for scalability and performance optimization.

Features#

Multiple Computational Engines: Choose between serial, parallel, Dask, and exactextract engines based on your data size and computational resources
High-Performance Option: The exactextract engine provides optimized C++ performance with fractional pixel coverage
Area-Weighted Statistics: Calculate more accurate statistics by accounting for partial pixel coverage within zones
Flexible Output Formats: Export results to CSV with customizable precision and file naming
Categorical and Continuous Data Support: Handle both categorical (e.g., land cover) and continuous (e.g., temperature) raster data
Performance Monitoring: Built-in execution time tracking for performance optimization

Quick Start#

Standard Zonal Statistics#

from gdptools.zonal_gen import ZonalGen
from gdptools.data.user_data import UserData

# Initialize user data (preprocessed raster and vector data)
user_data = UserData(...)

# Create zonal statistics generator
zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="parallel",
    zonal_writer="csv",
    out_path="/path/to/output",
    file_prefix="zonal_stats",
    jobs=4
)

# Calculate zonal statistics
stats = zonal_gen.calculate_zonal()
print(stats.head())

Area-Weighted Zonal Statistics#

Tip

For WeightedZonalGen, use an equal-area CRS for accurate area weighting. The project’s recommended default is EPSG:6931 (EASE-Grid 2.0 Global Equal Area), which performs well globally. For regional work, consider a local equal-area CRS (e.g., EPSG:5070 for CONUS, EPSG:3035 for Europe).

from gdptools.zonal_gen import WeightedZonalGen
from pyproj import CRS

# Use an appropriate equal-area projection
albers_crs = CRS.from_proj4("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96")

# Create weighted zonal statistics generator
weighted_gen = WeightedZonalGen(
    user_data=user_data,
    weight_gen_crs=albers_crs,
    zonal_engine="dask",
    zonal_writer="csv",
    out_path="/path/to/output",
    file_prefix="weighted_stats",
    precision=3,
    jobs=8
)

# Calculate area-weighted zonal statistics
weighted_stats = weighted_gen.calculate_weighted_zonal()

Computational Engines#

Serial Engine#

Best for small datasets or when memory is limited:

zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="serial",
    zonal_writer="csv"
)

Parallel Engine#

Optimal for medium to large datasets on multi-core systems:

zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="parallel",
    zonal_writer="csv",
    jobs=4  # Use 4 CPU cores
)

Dask Engine#

Best for very large datasets or distributed computing:

zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="dask",
    zonal_writer="csv",
    jobs=8  # Use 8 workers
)

Exactextract Engine#

Best for high-performance zonal statistics with fractional pixel coverage:

zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="exactextract",
    zonal_writer="csv",
    out_path="/path/to/output",
    file_prefix="zonal_stats"
)

The exactextract engine uses the exactextract library, which provides high-performance zonal statistics through optimized C++ code. Key benefits include:

Fractional pixel coverage: Accounts for partial pixel overlap with polygon boundaries
Memory efficiency: Processes data without loading entire rasters into memory
Speed: Often significantly faster than pure Python implementations for large datasets

Note

The exactextract engine requires the exactextract Python package (pip install exactextract). It produces output in the same format as the serial/parallel/dask engines, but values may differ slightly due to the fractional coverage algorithm vs. whole-pixel counting.

Data Types and Statistics#

Continuous Data#

For continuous raster data (temperature, precipitation, elevation):

stats = zonal_gen.calculate_zonal(categorical=False)
# Returns: count, mean, std, min, 25%, 50%, 75%, max, sum

Categorical Data#

For categorical raster data (land cover, soil types):

stats = zonal_gen.calculate_zonal(categorical=True)
# Returns: fraction columns for each category + count

Best Practices#

Coordinate Reference Systems#

For Area-Weighted Statistics: Use equal-area projections appropriate for your study region
Geographic Coordinates: Can introduce area distortions; consider reprojecting to equal-area
Local Projections: Often provide the most accurate area calculations

# Good: Equal-area projection for continental US
albers_crs = CRS.from_proj4("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96")

# Acceptable: UTM zone for smaller regions
utm_crs = CRS.from_epsg(32618)  # UTM Zone 18N

# Caution: Geographic coordinates may introduce area distortions
geographic_crs = CRS.from_epsg(4326)

Performance Optimization#

Choose the Right Engine: Start with parallel for most use cases
Tune Job Count: Typically use number of CPU cores minus 1. If you request more workers than available CPUs, the zonal engines automatically cap the value and emit a warning so you can adjust expectations.
Memory Management: Monitor memory usage with large datasets
Data Preprocessing: Ensure data is properly aligned and preprocessed

Output Management#

# Organized output with timestamps
zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="parallel",
    zonal_writer="csv",
    out_path="/results/zonal_stats",
    file_prefix="temperature_zones",
    append_date=True,  # Creates: 2024_01_15_14_30_00_temperature_zones.csv
    precision=2  # Round to 2 decimal places
)

Error Handling#

Common Issues and Solutions#

FileNotFoundError: Output path doesn’t exist

from pathlib import Path
output_path = Path("/path/to/output")
output_path.mkdir(parents=True, exist_ok=True)

Memory Issues: Dataset too large for available memory

# Switch to Dask engine for out-of-core processing
zonal_gen = ZonalGen(
    user_data=user_data,
    zonal_engine="dask",
    zonal_writer="csv",
    jobs=4
)

CRS Mismatch: Ensure data alignment

# Verify CRS compatibility before processing
print(f"Raster CRS: {user_data.raster_crs}")
print(f"Vector CRS: {user_data.vector_crs}")
print(f"Weight CRS: {weight_gen_crs}")

Advanced Examples#

Multi-Temporal Analysis#

import pandas as pd
from pathlib import Path

# Process multiple time periods
results = []
for year in range(2010, 2021):
    user_data = UserData(raster_path=f"/data/temp_{year}.tif", ...)

    zonal_gen = ZonalGen(
        user_data=user_data,
        zonal_engine="parallel",
        zonal_writer="csv",
        out_path=f"/results/{year}",
        file_prefix=f"temp_stats_{year}",
        jobs=4
    )

    stats = zonal_gen.calculate_zonal()
    stats['year'] = year
    results.append(stats)

# Combine all years
all_stats = pd.concat(results, ignore_index=True)

Comparison of Standard vs. Weighted Statistics#

# Calculate both standard and weighted statistics
standard_stats = zonal_gen.calculate_zonal()
weighted_stats = weighted_gen.calculate_weighted_zonal()

# Compare results
comparison = pd.merge(
    standard_stats[['zone_id', 'mean']],
    weighted_stats[['zone_id', 'weighted_mean']],
    on='zone_id',
    suffixes=('_standard', '_weighted')
)

comparison['difference'] = (
    comparison['weighted_mean'] - comparison['mean']
)

Type Definitions#

ZONAL_ENGINES#

Literal type alias for the zonal computation engines.

Options:

serial: Perform zonal calculations in a serial manner.: Deprecated since version Use: exactextract instead.
parallel: Perform zonal calculations in parallel using multiple processes.: Deprecated since version Use: exactextract instead.
dask: Perform zonal calculations using Dask for distributed computing.: Deprecated since version Use: exactextract instead.
exactextract: High-performance raster-native zonal statistics using exactextract.: This is the recommended engine. It only supports non-weighted statistics currently; weighted support is planned (issue #94).

Computational engine options for zonal statistics calculation. alias of Literal[‘serial’, ‘parallel’, ‘dask’, ‘exactextract’]

ZONAL_WRITERS#

Literal type alias for the zonal statistics output formats.

Options:: csv: Write zonal statistics to a CSV file.

Output format options for zonal statistics results. alias of Literal[‘csv’]

alias of Literal[‘csv’]

API Reference#

Classes#

class ZonalGen(user_data, zonal_engine, zonal_writer, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=1)[source]#

Bases: object

Calculate standard zonal statistics for raster data.

This class provides functionality to compute zonal statistics (mean, sum, count, etc.) for raster data within defined zones. It supports multiple computational engines for flexibility and performance optimization.

The class handles the entire workflow from data preparation to result output, including options for different computational backends and output formats.

Examples

Recommended usage with exactextract engine:

>>> from gdptools.zonal_gen import ZonalGen
>>> from gdptools.data.user_data import UserData
>>>
>>> # Create zonal statistics generator
>>> zonal_gen = ZonalGen(
...     user_data=user_data,
...     zonal_engine="exactextract",
...     zonal_writer="csv",
...     out_path="/path/to/output",
...     file_prefix="zonal_stats"
... )
>>>
>>> # Calculate zonal statistics
>>> stats = zonal_gen.calculate_zonal()

Using a deprecated engine (serial/parallel/dask emit a FutureWarning):

>>> zonal_gen = ZonalGen(
...     user_data=user_data,
...     zonal_engine="serial",  # deprecated, use "exactextract"
...     zonal_writer="csv",
...     out_path="/path/to/output",
...     file_prefix="zonal_stats",
...     append_date=True,
... )
>>> stats = zonal_gen.calculate_zonal()

agg#: The zonal engine instance used for calculations.

precision#: Number of decimal places for output statistics.

Initialize ZonalGen for calculating zonal statistics.

Parameters:

user_data (UserData) – An instance of UserData containing the preprocessed raster and vector data for zonal analysis.
zonal_engine (Literal['serial', 'parallel', 'dask', 'exactextract']) – The computational engine to use for zonal calculations. Options: “exactextract” (recommended), “serial”, “parallel”, or “dask”. The serial, parallel, and dask engines are deprecated and will be removed in a future version.
zonal_writer (Literal['csv']) – The output format for zonal statistics. Currently supports “csv”.
out_path (str | None) – Directory path where output files will be saved. If None, output will not be written to disk.
file_prefix (str | None) – Prefix for the output filename. If None, no prefix will be used.
append_date (bool) – Whether to append current timestamp to the filename. Creates filenames like “2024_01_15_14_30_00_prefix.csv”.
precision (int | None) – Number of decimal places for output statistics. If None, no rounding is applied.
jobs (int) – Number of parallel jobs for computation. Only used with “parallel” or “dask” engines.

Raises:

FileNotFoundError – If the specified output path does not exist.
TypeError – If an invalid zonal engine is specified.

Notes

The user_data should contain properly aligned raster and vector data with matching coordinate reference systems.

Standard zonal statistics calculator for raster data.

__init__(user_data, zonal_engine, zonal_writer, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=1)[source]#

Initialize ZonalGen for calculating zonal statistics.

Parameters:

user_data (UserData) – An instance of UserData containing the preprocessed raster and vector data for zonal analysis.
zonal_engine (Literal['serial', 'parallel', 'dask', 'exactextract']) – The computational engine to use for zonal calculations. Options: “exactextract” (recommended), “serial”, “parallel”, or “dask”. The serial, parallel, and dask engines are deprecated and will be removed in a future version.
zonal_writer (Literal['csv']) – The output format for zonal statistics. Currently supports “csv”.
out_path (str | None) – Directory path where output files will be saved. If None, output will not be written to disk.
file_prefix (str | None) – Prefix for the output filename. If None, no prefix will be used.
append_date (bool) – Whether to append current timestamp to the filename. Creates filenames like “2024_01_15_14_30_00_prefix.csv”.
precision (int | None) – Number of decimal places for output statistics. If None, no rounding is applied.
jobs (int) – Number of parallel jobs for computation. Only used with “parallel” or “dask” engines.

Raises:

FileNotFoundError – If the specified output path does not exist.
TypeError – If an invalid zonal engine is specified.

Notes

The user_data should contain properly aligned raster and vector data with matching coordinate reference systems.

calculate_zonal(categorical=False)[source]#

Calculate zonal statistics for raster data.

Computes zonal statistics (mean, sum, count, etc.) for raster data within defined zones using the specified computational engine. Results are automatically saved to the configured output path if specified.

Parameters:: categorical (bool) – Whether to calculate categorical statistics (mode, unique counts) or continuous statistics (mean, sum, std, etc.). For categorical data like land cover classifications, set to True.
Returns:: Calculated zonal statistics with zones as rows and statistics as columns. The exact columns depend on the data type and whether categorical=True.
Return type:: pandas.DataFrame

Notes

The method automatically rounds results to the specified precision if set during initialization. Execution time is printed to stdout for performance monitoring.

Examples: Calculate continuous zonal statistics:

>>> stats = zonal_gen.calculate_zonal(categorical=False)
>>> print(stats.columns) 
Index(['zone_id', 'mean', 'sum', 'count', 'std', 'min', 'max'])

Calculate categorical zonal statistics:

>>> stats = zonal_gen.calculate_zonal(categorical=True)
>>> print(stats.columns) 
Index(['zone_id', 'mode', 'unique_count', 'majority_percent'])

class WeightedZonalGen(user_data, weight_gen_crs, zonal_engine, zonal_writer, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=1)[source]#

Bases: object

Calculate area-weighted zonal statistics for raster data.

This class extends standard zonal statistics by incorporating area weights based on the intersection of raster cells with zone boundaries. This is particularly important when working with data in geographic coordinate systems or when raster cells are partially covered by zones.

The weighting is based on the specified CRS and accounts for the actual area of intersection between raster cells and zone polygons, providing more accurate statistics for geographic analyses.

Note

The serial, parallel, and dask engines are deprecated and will be removed in a future version. Weighted zonal statistics support via the exactextract engine is planned (issue #94).

Examples

Basic weighted zonal statistics (deprecated engines):

>>> from gdptools.zonal_gen import WeightedZonalGen
>>> from pyproj import CRS
>>>
>>> # Create weighted zonal statistics generator
>>> # Note: serial/parallel/dask engines are deprecated; exactextract
>>> # weighted support is planned (issue #94).
>>> weighted_gen = WeightedZonalGen(
...     user_data=user_data,
...     weight_gen_crs=CRS.from_epsg(4326),
...     zonal_engine="parallel",  # deprecated
...     zonal_writer="csv",
...     out_path="/path/to/output",
...     file_prefix="weighted_stats",
...     jobs=4
... )
>>>
>>> # Calculate weighted zonal statistics
>>> stats = weighted_gen.calculate_weighted_zonal()

Using with equal-area projection for accurate area calculations:

>>> # Use an equal-area projection for more accurate area weighting
>>> albers_crs = CRS.from_proj4(
...     "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96"
... )
>>> weighted_gen = WeightedZonalGen(
...     user_data=user_data,
...     weight_gen_crs=albers_crs,
...     zonal_engine="dask",  # deprecated
...     zonal_writer="csv",
...     out_path="/path/to/output",
...     precision=3
... )
>>> stats = weighted_gen.calculate_weighted_zonal()

agg#: The zonal engine instance used for calculations.

precision#: Number of decimal places for output statistics.

Initialize WeightedZonalGen for calculating area-weighted zonal statistics.

Parameters:

user_data (UserData) – An instance of UserData containing the preprocessed raster and vector data for zonal analysis.
weight_gen_crs (str | int | CRS) – The coordinate reference system for area weight calculations. Can be an EPSG code (int), proj4 string (str), or pyproj CRS object. Should be an appropriate equal-area projection for accurate area calculations.
zonal_engine (Literal['serial', 'parallel', 'dask', 'exactextract']) – The computational engine to use for zonal calculations. Options: “serial”, “parallel”, or “dask”. All three are deprecated and will be removed in a future version. Weighted zonal statistics support via exactextract is planned (issue #94).
zonal_writer (Literal['csv']) – The output format for zonal statistics. Currently supports “csv”.
out_path (str | None) – Directory path where output files will be saved. If None, output will not be written to disk.
file_prefix (str | None) – Prefix for the output filename. If None, no prefix will be used.
append_date (bool) – Whether to append current timestamp to the filename. Creates filenames like “2024_01_15_14_30_00_prefix.csv”.
precision (int | None) – Number of decimal places for output statistics. If None, no rounding is applied.
jobs (int) – Number of parallel jobs for computation. Only used with “parallel” or “dask” engines.

Raises:

FileNotFoundError – If the specified output path does not exist.
TypeError – If an invalid zonal engine is specified.

Notes

For most accurate area-weighted calculations, use an equal-area projection appropriate for your study region. Geographic coordinate systems (like EPSG:4326) can introduce area distortions.

Area-weighted zonal statistics calculator for raster data.

__init__(user_data, weight_gen_crs, zonal_engine, zonal_writer, out_path=None, file_prefix=None, append_date=False, precision=None, jobs=1)[source]#

Initialize WeightedZonalGen for calculating area-weighted zonal statistics.

Parameters:

user_data (UserData) – An instance of UserData containing the preprocessed raster and vector data for zonal analysis.
weight_gen_crs (str | int | CRS) – The coordinate reference system for area weight calculations. Can be an EPSG code (int), proj4 string (str), or pyproj CRS object. Should be an appropriate equal-area projection for accurate area calculations.
zonal_engine (Literal['serial', 'parallel', 'dask', 'exactextract']) – The computational engine to use for zonal calculations. Options: “serial”, “parallel”, or “dask”. All three are deprecated and will be removed in a future version. Weighted zonal statistics support via exactextract is planned (issue #94).
zonal_writer (Literal['csv']) – The output format for zonal statistics. Currently supports “csv”.
out_path (str | None) – Directory path where output files will be saved. If None, output will not be written to disk.
file_prefix (str | None) – Prefix for the output filename. If None, no prefix will be used.
append_date (bool) – Whether to append current timestamp to the filename. Creates filenames like “2024_01_15_14_30_00_prefix.csv”.
precision (int | None) – Number of decimal places for output statistics. If None, no rounding is applied.
jobs (int) – Number of parallel jobs for computation. Only used with “parallel” or “dask” engines.

Raises:

FileNotFoundError – If the specified output path does not exist.
TypeError – If an invalid zonal engine is specified.

Notes

For most accurate area-weighted calculations, use an equal-area projection appropriate for your study region. Geographic coordinate systems (like EPSG:4326) can introduce area distortions.

calculate_weighted_zonal(categorical=False)[source]#

Calculate area-weighted zonal statistics for raster data.

Computes zonal statistics weighted by the actual area of intersection between raster cells and zone boundaries. This provides more accurate statistics when raster cells are partially covered by zones or when working with geographic coordinate systems.

Parameters:: categorical (bool) – Whether to calculate categorical statistics (mode, unique counts) or continuous statistics (mean, sum, std, etc.). For categorical data like land cover classifications, set to True.
Returns:: A DataFrame containing the calculated area-weighted zonal statistics with zones as rows and statistics as columns. The exact columns depend on the data type and whether categorical=True.
Return type:: DataFrame

Notes

The method automatically rounds results to the specified precision if set during initialization. Execution time is printed to stdout for performance monitoring.

Examples

Calculate weighted continuous statistics:

>>> stats = weighted_gen.calculate_weighted_zonal(categorical=False)
>>> print(stats.columns) 
Index(['zone_id', 'weighted_mean', 'weighted_sum', 'total_area'])

Calculate weighted categorical statistics:

>>> stats = weighted_gen.calculate_weighted_zonal(categorical=True)
>>> print(stats.columns) 
Index(['zone_id', 'weighted_mode', 'area_weighted_majority'])

Zonal Statistics Generation

Contents

Zonal Statistics Generation#

Features#

Quick Start#

Standard Zonal Statistics#

Area-Weighted Zonal Statistics#

Computational Engines#

Serial Engine#

Parallel Engine#

Dask Engine#

Exactextract Engine#

Data Types and Statistics#

Continuous Data#

Categorical Data#

Best Practices#

Coordinate Reference Systems#

Performance Optimization#

Output Management#

Error Handling#

Common Issues and Solutions#

Advanced Examples#

Multi-Temporal Analysis#

Comparison of Standard vs. Weighted Statistics#

Type Definitions#

API Reference#

Classes#

See Also#