2 Utilities

This chapter provides a brief description of the utility functions used in this material. Most of these functions are available in the R package {egvtools}, which was created specifically for this work.

2.1 R package egvtools

{egvtools} provides a coherent set of wrappers and utilities that facilitate the reproducible and efficient creation of large-scale EGVs on real datasets. The package relies on robust building blocks — {terra}, {sf}, {sfarrow}, {exactextractr} and {whitebox} — and standardises input/output data, naming conventions and multi-scale zonal statistics, ensuring that the pipelines are repeatable across machines and projects.

The package was developed for the project ‘HiQBioDiv: High-resolution quantification of biodiversity for conservation and management’, which was funded by the Latvian Council of Science (Ref. No. VPP-VARAM-DABA-2024/1-0002), to simplify our work and to facilitate the reproduction of our results. Five of the functions are strictly for replication, while others are useful for a wider audience.

Package can be installed from GitHub with:

Code

# install.packages("pak")
pak::pak("aavotins/egvtools")

or obtained as a Docker container with all the necessary system and software dependencies.

2.1.1 Reproduction only functions

These functions are small wrappers, that help to recreate our working environments - template files and their locations in the file tree.

These functions are:

download_raster_templates() — fetch template rasters from Zenodo repository and place them in a user specified location on the disk, or by default - the place we used. By default this function links to the version 2.0.0 of the dataset;
download_vector_templates() - fetch template vector grids/points from Zenodo repository and place them in a user specified location on the disk, or by default - the place we used. By default this function links to the version 1.0.1 of the dataset;
radius_function() — extracts summary statistics from raster layers using buffered polygon zones of multiple radii and rasterises them onto a common template grid. Internally hard coded to use filenames (first and second part in the result of tiling functions) as used in this project. If the filenames are kept, function can easily be used for other projects, regions etc. Function can be used to run sequentially, however much faster compute will be with parallel computing. If fast swap disk is available, this function needs only c.a. 5 GiBs of RAM per worker to perform tasks in this project. However, if the swap disk is not available, at least 20 GiBs of RAM per worker need to be assigned.

2.1.2 General purpose functions

Each of those functions are small workflows themselves that can be combined into larger workflows and used more widely than for Latvia.

tile_vector_grid() — tile template (vector) grid for chunked processing. The function internally is linked to our file naming convention. As long as it is maintained, function can be used to create tiled grid from any {sfarrow} parquet grid file;
tiled_buffers() — precompute buffered tiles for multiple radii around points. The function internally is linked to our file naming convention. As long as it is maintained, function can be used to create tiled polygons with buffers around points from any {sfarrow} parquet grid file. There are three buffering modes: dense (buffers the best-matching pts100*.parquet (prefers pts100_sauzeme.parquet) for each tile by radii_dense (default: 500, 1250, 3000, 10000 m ensuring that every analysis grid cell has desired buffer. Computationally heavy in the following workflows), sparse (uses a file to radius mapping and is highly generalizable), and specified (the same as sparse, but with one single point file). In our workflows we used the sparse mode with default mapping;
create_backgrounds() — a wrapper around terra::ifel() to build consistent background rasters. This function better guards coordinate reference system and how it is stored, while also guarding spatial cover, resolution, coordinate reference system, exact pixel matching, etc. Creation of layers with default background values is faster than recreating them several times in workflows preparing EGVs;
polygon2input() — rasterise polygons to input layers. Handles only polygon data, other geometry types need to be buffered. Rasterizes polygon/multipolygon sf data to a raster aligned to a template GeoTIFF. Rasterization targets a raster::RasterLayer built from the template (so grids normally match). Projection is optional (project_mode). Missing values are counted only over valid template cells. User may optionally restrict the result with a raster mask (restrict_to) using numeric values or bracketed range strings (e.g., “(0,5]”, “[10,)”). Remaining NA cells can be filled by covering with a background raster (background_raster) or a constant (background_value). For large rasters, heavy steps (projection/mask/cover) can stream to disk via terra_todisk=TRUE.
input2egv() — normalize/align a fine-resolution input raster to a (coarser) EGV template, optionally cover missing values and/or fill gaps (IDW via Whitebox), and write the result to disk. Designed for large runs: fast gap counting (inside template footprint only), optional filling, tuned GDAL write options, and controlled terra memory/temp behavior.
downscale2egv() — downscale coarse rasters to a template grid (CRS, resolution, extent), masks to the template footprint, and optionally: (1) fills NoData gaps using WhiteboxTools’ IDW-based fill_missing_data, and (2) applies IDW smoothing to reduce blockiness from low-resolution inputs.
distance2egv() — computes Euclidean distance (in map units) from cells matching a set of class values in an input raster to all cells of an EGV template grid, then writes a Float32 GeoTIFF aligned to the template. Designed to work with rasters produced by polygon2input().
landscape_function() — computes a {landscapemetrics} metric (default “lsm_l_shdi”), optionally with extra lm_args, that yields one value per zone and per input layer. Runs tile-by-tile (by tile_field), writes per-tile rasters, merges to final per-layer GeoTIFF(s), then performs gap analysis (NA count within the template footprint and optional maximum gap width) and optional IDW gap filling via WhiteboxTools. Returns a compact data.frame with per-layer stats and timing. Function can be used to run sequentially, however much faster compute will be with parallel computing. If fast swap disk is available, this function needs only 3 GiBs of RAM per worked to perform tasks in this project. However, if the swap disk is not available, at least 20 GiBs of RAM per worker need to be assigned.

2.2 Other utility functions

Other handy functions repeatedly used, not included in {egvtools} are stored in egvs02.02_UtilityFunctions.R file, located in Data/RScipts_final.

ensure_multipolygons() - rather agressive function to create MULTIPOLYGON geometries from GEOMETRYCOLLECTION

Code

if(!require(sf)) {install.packages("sf"); require(sf)}
if(!require(gdalUtilities)) {install.packages("gdalUtilities"); require(gdalUtilities)}

ensure_multipolygons <- function(X) {
  library(sf)
  library(gdalUtilities)
  
  tmp1 <- tempfile(fileext = ".gpkg")
  tmp2 <- tempfile(fileext = ".gpkg")
  st_write(X, tmp1)
  ogr2ogr(tmp1, tmp2, f = "GPKG", nlt = "MULTIPOLYGON")
  Y <- st_read(tmp2)
  st_sf(st_drop_geometry(X), geom = st_geometry(Y))
}