Preface

Welcome! This book documents the geodata and processing workflows used to create ecogeographical variables (EGVs) for species distribution modelling in Latvia (2024).

This material presents the results of three University of Latvia projects deeply rooted in species distribution modelling and, more importantly, explains the workflow and decisions made to ensure their repeatability and reproducibility. These projects are:

  • The project “Preparation of a geospatial data layer covering existing protected areas for the implementation of the EU Biodiversity Strategy 2030” (No. 1-08/73/2023), funded by the Administrations of the Latvian Environmental Protection Fund;

  • Scientific research service project commissioned by the Joint Stock Company “Latvijas valsts meži” (Latvian State Forests) “Improvement of the monitoring of the northern goshawk Accipiter gentilis and creation of a spatial model of habitat suitability” (Latvian State Forests document No. 5-5.5.1_000r_101_23_27_6);

  • State research program “Development of research specified in the Biodiversity Priority Action Program” project “High-resolution quantification of biodiversity for nature conservation and management: HiQBioDiv” (VPP-VARAM-DABA-2024/1-0002).

The material was developed in R using {bookdown}. The data processing and analysis described in the content was mainly performed in R, and one of the main reasons for creating this material was to transfer the information necessary for reproducing the work using verified command lines. A desirable side effect is to promote openness and reproducibility in scientific practice and practical science.

About this material

This material is not:

  • an introduction to R or other programming language. On the contrary, it will be most useful to those who already understand how to use command lines. However, it will also be informative for other users regarding the approaches used;

  • a tutorial on geoprocessing. This material summarizes the approaches that, at the time of its development, were known to the authors as the most effective (in terms of processing time, RAM and hard disk space, performance guarantees, and reliability), but they are certainly not the only ones possible;

  • copy/paste ready product. Although the use and publication of command lines tends to be intended for these purposes, in a situation where large amounts of data and, at least in part, restricted access data are used for the work, this is simply not possible. However, by ensuring data availability and placement in accordance with the file structure of this project (available at root/Data or by forking template repository), the command lines will be repeatable without changes and will produce the same results.

This material has been prepared to provide a reproducible workflow, describing the decisions made and solutions implemented in the preparation of ecogeographical variables for species distribution (habitat suitability) modelling for biodiversity conservation planning.

For the most part, this material consists of:

  • explanatory text, which is recognizable as text;

  • command lines, which are hidden by default to make the text easier to read. The locations of the command lines can be identified by the “|> Code” visible on the left side of the page, just below this paragraph. Clicking on it will open the code area, where the text on a grey background is command lines, for example:

Code
object=function(arguments1,arguments2,
path="./path/file/tree/object.extension")
# comment

In the example above, the first line creates an object (“object”) that is the result of a function (“function()”). The function has three arguments (“arguments1”, “arguments2” and “path”) separated by commas (as with all function arguments in R). The third argument is the path in the file tree. It is on “a new line” but is a continuation of the function on the previous line, because the parentheses are not closed. Note the beginning “./”, which indicates a relative path - the location in the file tree is relative to the project location.

The second line of the example above is a comment - everything after “#” is a comment. Anything in a command line before “#” must be an executable function or object. A comment can contain anything and be on the same line as an executable function (at the end of it).

Command lines are the most important part of this material for reproducibility. However, the person using them must ensure the availability of input data and maintain correct paths in the file tree.

In this material code chunks are formatted as individual pieces to better pinpoint commands used for a job described in the text around. However, in practical setting the creation of ecogeographical variables will be much faster, if they will be combined in loops or other batch processing setup. Command lines used in practice are available in the home repository of this material at
Data/RScripts_final, they can be executed in an alphanumeric order, if not specified differently. We performed parts of the compute on the University of Latvia Institute of Numerical Modelling HPC cluster with the same file tree as in this material. Shell scripts used to run R commands are available in the home repository of this material at Data/hpc_io/Jobs_shell/2024/EGVs.

Sometimes we will refer to R packages in the text, we will put them in curly brackets, for example, {package}.

  • graphics - occasional diagrams that describe the workflow or data characteristics and maps;

  • links to other resources, especially to higher-level products and results created within the project, as well as any publicly available data. The results are intended for practical use.

Within reason, the material describes all data sets used and provides metadata related to ensuring reproducibility. Since not all data sets are freely available, they are not published as such, but in all cases information is provided on how they were obtained for the development of this project.