Zonal statistics of rasters#
import geopandas as gpd
import pandas as pd
import rioxarray as rxr
import xarray as xr
from gdptools import ZonalGen
from gdptools.data.user_data import UserTiffData
# Be sure to use rioxarray to read tiff
rds_slope = rxr.open_rasterio("../../../tests/data/rasters/slope/slope.tif")
rds_text = rxr.open_rasterio("../../../tests/data/rasters/TEXT_PRMS/TEXT_PRMS.tif")
# Use geopandas to read shape file,
# Best if shape file only has feature id, and geometry if possible.
gdf = gpd.read_file("../../../tests/data/Oahu.shp")
id_feature = "fid"
print(len(gdf.groupby(id_feature)))
# These params are used to fill out the TiffAttributes class
tx_name= 'x'
ty_name = 'y'
band = 'band'
crs = 26904
varname = "slope" # not currently used
categorical = False # is the data categorical or not, if the data are integers it should
# probably be categorical.
data = UserTiffData(
var=varname,
ds=rds_slope,
proj_ds=crs,
x_coord=tx_name,
y_coord=ty_name,
band=1,
bname=band,
f_feature=gdf,
id_feature=id_feature,
proj_feature=crs
)
zonal_gen = ZonalGen(
user_data=data,
zonal_engine="serial",
zonal_writer="csv",
out_path=".",
file_prefix="Oahu_slope_stats"
)
stats = zonal_gen.calculate_zonal(categorical=categorical)
print(stats)
# stats = zonal_gen.calculate_zonal(categorical=True)
# print(stats)
304
data prepped for zonal in 0.0087 seconds
converted tiff to points in 1.3728 seconds
- fixing 0 invalid polygons.
overlaps calculated in 1.4767 seconds
zonal stats calculated in 0.6617 seconds
[]
number of missing values: 0
fill missing values with nearest neighbors in 0.0040 seconds
Total time for serial zonal stats calculation 3.8094 seconds
count mean std min 25% 50% 75% max sum
fid
1104.0 20431.0 1.872155 3.757167 0.0 0.00 0.0 1.0 32.0 38250
1105.0 2388.0 26.350921 13.473957 5.0 14.00 24.0 37.0 61.0 62926
1106.0 3192.0 33.687030 14.188686 1.0 22.00 34.0 45.0 70.0 107529
1107.0 6600.0 23.704091 15.854788 0.0 10.00 21.0 37.0 66.0 156447
1108.0 11380.0 22.116169 14.062708 0.0 11.00 21.0 33.0 62.0 251682
... ... ... ... ... ... ... ... ... ...
1403.0 2753.0 1.722121 3.183146 0.0 0.00 1.0 1.0 23.0 4741
1404.0 99.0 1.363636 1.082840 0.0 0.00 2.0 2.0 3.0 135
1405.0 174.0 10.505747 7.044653 0.0 5.25 9.0 14.0 32.0 1828
1406.0 3089.0 10.931693 10.120911 0.0 4.00 7.0 15.0 59.0 33768
1407.0 321.0 1.679128 1.702893 0.0 1.00 1.0 2.0 9.0 539
[304 rows x 9 columns]
gdf.sort_values(by=id_feature, inplace=True)
stats.sort_values(by=id_feature, inplace=True)
gdf["slope_mean"] = stats["mean"].values
gdf
fid | rpu | cat_id | hru_segmen | geometry | slope_mean | |
---|---|---|---|---|---|---|
0 | 1104.0 | f | 1104 | 995.0 | POLYGON ((601200.000 2359860.000, 601180.267 2... | 1.872155 |
1 | 1105.0 | f | 1105 | 996.0 | POLYGON ((593760.000 2369060.000, 593675.200 2... | 26.350921 |
2 | 1106.0 | f | 1106 | 997.0 | POLYGON ((615055.367 2375324.633, 615130.633 2... | 33.687030 |
3 | 1107.0 | f | 1107 | 998.0 | POLYGON ((619150.000 2380290.000, 619096.633 2... | 23.704091 |
4 | 1108.0 | f | 1108 | 999.0 | POLYGON ((614640.000 2382950.000, 614628.833 2... | 22.116169 |
... | ... | ... | ... | ... | ... | ... |
299 | 1403.0 | f | 1403 | 1070.0 | POLYGON ((620198.262 2355012.754, 620153.300 2... | 1.722121 |
300 | 1404.0 | f | 1404 | 1029.0 | POLYGON ((579986.031 2380993.933, 580280.000 2... | 1.363636 |
301 | 1405.0 | f | 1405 | 1146.0 | POLYGON ((625200.000 2365570.000, 625207.872 2... | 10.505747 |
302 | 1406.0 | f | 1406 | 1146.0 | POLYGON ((625144.965 2365356.034, 625183.900 2... | 10.931693 |
303 | 1407.0 | f | 1407 | 1233.0 | POLYGON ((618131.067 2357417.700, 618105.367 2... | 1.679128 |
304 rows × 6 columns
gdf.plot(column="slope_mean")
<Axes: >
Parallel generator#
In this case the generator is slower because the domain is small and the overhead of generating the parallel processes takes more time than the actual calculation. However with a large domain the “parallel” zonal_engine is faster and has a smaller memory footprint.
zonal_gen2 = ZonalGen(
user_data=data,
zonal_engine="parallel",
zonal_writer="csv",
out_path=".",
file_prefix="Oahu_slope_stats_p"
)
stats = zonal_gen2.calculate_zonal(categorical=categorical)
print(stats)
data prepped for zonal in 0.0098 seconds
- fixing 0 invalid polygons.
<class 'pandas.core.frame.DataFrame'>
Parallel calculation of zonal stats in 9.906 seconds
Total time for serial zonal stats calculation 9.9332 seconds
count mean std min 25% 50% 75% max sum
fid
1104.0 20431.0 1.872155 3.757167 0.0 0.00 0.0 1.0 32.0 38250
1105.0 2388.0 26.350921 13.473957 5.0 14.00 24.0 37.0 61.0 62926
1106.0 3192.0 33.687030 14.188686 1.0 22.00 34.0 45.0 70.0 107529
1107.0 6600.0 23.704091 15.854788 0.0 10.00 21.0 37.0 66.0 156447
1108.0 11380.0 22.116169 14.062708 0.0 11.00 21.0 33.0 62.0 251682
... ... ... ... ... ... ... ... ... ...
1403.0 2753.0 1.722121 3.183146 0.0 0.00 1.0 1.0 23.0 4741
1404.0 99.0 1.363636 1.082840 0.0 0.00 2.0 2.0 3.0 135
1405.0 174.0 10.505747 7.044653 0.0 5.25 9.0 14.0 32.0 1828
1406.0 3089.0 10.931693 10.120911 0.0 4.00 7.0 15.0 59.0 33768
1407.0 321.0 1.679128 1.702893 0.0 1.00 1.0 2.0 9.0 539
[304 rows x 9 columns]
gdf.sort_values(by=id_feature, inplace=True)
stats.sort_values(by=id_feature, inplace=True)
gdf["slope_mean"] = stats["mean"].values
gdf
fid | rpu | cat_id | hru_segmen | geometry | slope_mean | |
---|---|---|---|---|---|---|
0 | 1104.0 | f | 1104 | 995.0 | POLYGON ((601200.000 2359860.000, 601180.267 2... | 1.872155 |
1 | 1105.0 | f | 1105 | 996.0 | POLYGON ((593760.000 2369060.000, 593675.200 2... | 26.350921 |
2 | 1106.0 | f | 1106 | 997.0 | POLYGON ((615055.367 2375324.633, 615130.633 2... | 33.687030 |
3 | 1107.0 | f | 1107 | 998.0 | POLYGON ((619150.000 2380290.000, 619096.633 2... | 23.704091 |
4 | 1108.0 | f | 1108 | 999.0 | POLYGON ((614640.000 2382950.000, 614628.833 2... | 22.116169 |
... | ... | ... | ... | ... | ... | ... |
299 | 1403.0 | f | 1403 | 1070.0 | POLYGON ((620198.262 2355012.754, 620153.300 2... | 1.722121 |
300 | 1404.0 | f | 1404 | 1029.0 | POLYGON ((579986.031 2380993.933, 580280.000 2... | 1.363636 |
301 | 1405.0 | f | 1405 | 1146.0 | POLYGON ((625200.000 2365570.000, 625207.872 2... | 10.505747 |
302 | 1406.0 | f | 1406 | 1146.0 | POLYGON ((625144.965 2365356.034, 625183.900 2... | 10.931693 |
303 | 1407.0 | f | 1407 | 1233.0 | POLYGON ((618131.067 2357417.700, 618105.367 2... | 1.679128 |
304 rows × 6 columns
gdf.plot(column="slope_mean")
<Axes: >
varname = "TEXT" # not currently used
categorical = True # is the data categorical or not, if the data are integers it should
# probably be categorical.
data = UserTiffData(
var=varname,
ds=rds_text,
proj_ds=crs,
x_coord=tx_name,
y_coord=ty_name,
band=1,
bname=band,
f_feature=gdf,
id_feature=id_feature,
proj_feature=crs
)
zonal_gen = ZonalGen(
user_data=data,
zonal_engine="serial",
zonal_writer="csv",
out_path=".",
file_prefix="Oahu_TEXT_stats"
)
stats = zonal_gen.calculate_zonal(categorical=categorical)
print(stats)
data prepped for zonal in 0.0052 seconds
converted tiff to points in 0.0084 seconds
- fixing 0 invalid polygons.
overlaps calculated in 0.0204 seconds
categorical zonal stats calculated in 0.2473 seconds
[[ 0 1 2 3 4 5 6]
[ 24 88 191 108 122 281 290]]
number of missing values: 7
fill missing values with nearest neighbors in 0.0088 seconds
Total time for serial zonal stats calculation 0.3143 seconds
count unique top freq
fid
1104.0 295 2 2 193
1105.0 34 2 2 30
1106.0 48 2 2 47
1107.0 94 2 3 76
1108.0 167 2 3 134
... ... ... ... ...
1298.0 43 2 3 35
1307.0 550 2 2 463
1377.0 103 2 2 87
1389.0 20 1 2 20
1401.0 41 2 2 25
[304 rows x 4 columns]
stats.sort_values(by=id_feature, inplace=True)
gdf["TEXT"] = stats["top"].values
gdf
fid | rpu | cat_id | hru_segmen | geometry | slope_mean | TEXT | |
---|---|---|---|---|---|---|---|
0 | 1104.0 | f | 1104 | 995.0 | POLYGON ((601200.000 2359860.000, 601180.267 2... | 1.872155 | 2 |
1 | 1105.0 | f | 1105 | 996.0 | POLYGON ((593760.000 2369060.000, 593675.200 2... | 26.350921 | 2 |
2 | 1106.0 | f | 1106 | 997.0 | POLYGON ((615055.367 2375324.633, 615130.633 2... | 33.687030 | 2 |
3 | 1107.0 | f | 1107 | 998.0 | POLYGON ((619150.000 2380290.000, 619096.633 2... | 23.704091 | 3 |
4 | 1108.0 | f | 1108 | 999.0 | POLYGON ((614640.000 2382950.000, 614628.833 2... | 22.116169 | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
299 | 1403.0 | f | 1403 | 1070.0 | POLYGON ((620198.262 2355012.754, 620153.300 2... | 1.722121 | 2 |
300 | 1404.0 | f | 1404 | 1029.0 | POLYGON ((579986.031 2380993.933, 580280.000 2... | 1.363636 | 2 |
301 | 1405.0 | f | 1405 | 1146.0 | POLYGON ((625200.000 2365570.000, 625207.872 2... | 10.505747 | 2 |
302 | 1406.0 | f | 1406 | 1146.0 | POLYGON ((625144.965 2365356.034, 625183.900 2... | 10.931693 | 2 |
303 | 1407.0 | f | 1407 | 1233.0 | POLYGON ((618131.067 2357417.700, 618105.367 2... | 1.679128 | 2 |
304 rows × 7 columns
gdf.plot(column="TEXT")
<Axes: >
Parallel engine#
zonal_gen2 = ZonalGen(
user_data=data,
zonal_engine="parallel",
zonal_writer="csv",
out_path=".",
file_prefix="Oahu_TEXT_stats"
)
stats = zonal_gen2.calculate_zonal(categorical=categorical)
print(stats)
data prepped for zonal in 0.0071 seconds
- fixing 0 invalid polygons.
<class 'pandas.core.frame.DataFrame'>
Parallel calculation of zonal stats in 3.133 seconds
[[ 0 1 2 3 4 5 6]
[ 24 88 191 108 122 281 290]]
number of missing values: 7
fill missing values with nearest neighbors in 0.0085 seconds
Total time for serial zonal stats calculation 3.1641 seconds
count unique top freq
fid
1104.0 295 2 2 193
1105.0 34 2 2 30
1106.0 48 2 2 47
1107.0 94 2 3 76
1108.0 167 2 3 134
... ... ... ... ...
1298.0 43 2 3 35
1307.0 550 2 2 463
1377.0 103 2 2 87
1389.0 20 1 2 20
1401.0 41 2 2 25
[304 rows x 4 columns]
stats.sort_values(by=id_feature, inplace=True)
gdf["TEXT"] = stats["top"].values
gdf
fid | rpu | cat_id | hru_segmen | geometry | slope_mean | TEXT | |
---|---|---|---|---|---|---|---|
0 | 1104.0 | f | 1104 | 995.0 | POLYGON ((601200.000 2359860.000, 601180.267 2... | 1.872155 | 2 |
1 | 1105.0 | f | 1105 | 996.0 | POLYGON ((593760.000 2369060.000, 593675.200 2... | 26.350921 | 2 |
2 | 1106.0 | f | 1106 | 997.0 | POLYGON ((615055.367 2375324.633, 615130.633 2... | 33.687030 | 2 |
3 | 1107.0 | f | 1107 | 998.0 | POLYGON ((619150.000 2380290.000, 619096.633 2... | 23.704091 | 3 |
4 | 1108.0 | f | 1108 | 999.0 | POLYGON ((614640.000 2382950.000, 614628.833 2... | 22.116169 | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
299 | 1403.0 | f | 1403 | 1070.0 | POLYGON ((620198.262 2355012.754, 620153.300 2... | 1.722121 | 2 |
300 | 1404.0 | f | 1404 | 1029.0 | POLYGON ((579986.031 2380993.933, 580280.000 2... | 1.363636 | 2 |
301 | 1405.0 | f | 1405 | 1146.0 | POLYGON ((625200.000 2365570.000, 625207.872 2... | 10.505747 | 2 |
302 | 1406.0 | f | 1406 | 1146.0 | POLYGON ((625144.965 2365356.034, 625183.900 2... | 10.931693 | 2 |
303 | 1407.0 | f | 1407 | 1233.0 | POLYGON ((618131.067 2357417.700, 618105.367 2... | 1.679128 | 2 |
304 rows × 7 columns
gdf.plot(column="TEXT")
<Axes: >