# ‘RISDM‘: species distribution modelling from multiple data sources in R

## Abstract

Species distribution models (SDMs) are usually based on a single data type, such as presence-only (PO), presence-absence (PA) or abundance (AA). Results from SDMs using single sources of data will suffer from inherent biases and limitations to that data type. For example, PO data contain sampling-bias and PA/AA data are often less expansive and more sparse. Integrated SDMs (ISDMs) combine multiple data types and have recently emerged as a way to leverage strengths and minimise weaknesses of the different data types. They pose a common (distribution) model and separate observation models for each of the data types. The ‘RISDM' package for the R environment (www.r-project.org) provides access to this modelling framework using functions for preparation, fitting, interpreting and diagnosing models. The functionality of the package is demonstrated here using synthetic data sets.

## Introduction

Species distribution models (SDMs) are a well-established and useful tool for ecologists and natural resource managers (Elith and Leathwick 2009, Guisan et al. 2013). They allow researchers to quantify and predict by leveraging off relationships between species data and the environment, space and time (Miller 2010). SDMs are data-driven models and are hence limited by the data that they are built upon. As such, they should incorporate informative data wherever possible even when these data come from different sources, have different measurements, and potentially contain different biases (see Barry and Elith 2006 for an extensive list). The challenge in integrating different data sources is statistical: how can the data's information be aligned so that the resulting inferences are coherent with appropriate measures of uncertainty (Fletcher et al. 2019, Miller et al. 2019, Isaac et al. 2020)?

Survey data are typically high-quality as their interpretation follows directly from the survey design. These survey data are often measured as presence–absence (PA) or abundance (AA) observations and are expensive to acquire. Hence, they are often spatially sparse and/or will also have patchy spatial coverage. In contrast, records of where individuals of a species have been seen, such as those in biodiversity atlases, are much more common. These records, called presence-only (PO) data, usually cover a broader geographical area than survey data, and so potentially contain a substantial amount of information about the species distribution. However, PO data can be biased as the species is more likely to be seen where people tend to be searching/reporting (sampling bias, Dennis and Thomas 2000, Reddy and Dávalos 2003, Hortal et al. 2008, Warton et al. 2013). Integrated SDMs (ISDMs) aim to combine these data sources and in doing so leverage the accuracy of survey (PA and AA) data and the volume of PO data, whilst simultaneously mitigating sources of bias (e.g. sampling bias).

The ‘RISDM' package for the R environment (www.r-project.org) provides a framework in which analysts can model a species' distribution from multiple sources of data: PO, PA and AA. The intention in developing ‘RISDM' is to make the ISDM framework accessible to a wide range of users, with a straight-forward and familiar syntax. The ISDM in ‘RISDM' extend the familiar generalised linear model (GLM) by allowing different probability models for each data source, but shared parameters for the species distribution (Fletcher et al. 2019, Miller et al. 2019, Isaac et al. 2020). Integrating data within a single model allows for straight-forward propogation of uncertainty and a simplified statistical interpretation.

Alternative R-packages to ‘RISDM' are: ‘spOccupancy' (Doser et al. 2022), ‘PointedSDMs' (Mostert and O'Hara 2022), and ‘ibis.iSDM' (Jung 2023). ‘RISDM' differs from ‘spOccupancy' in the type of target data, with ‘spOccupancy' focussing entirely on occupancy data (a particular type of PA or AA data) and therefore is not appropriate for the types of analyses that ‘RISDM' accommodates (and vice versa). ‘ibis.iSDM' offers a wide array of models and options. The most complex of these is a similar model to ‘RISDM', but only accommodates for PO and PA data. ‘PointedSDMs' is the closest in nature to ‘RISDM' as it, like ‘RISDM', implements a model based on that in Isaac et al. (2020) and used the code in Dambly et al. (2019) as an initial template. After forking from the Dambly et al. (2019) code, ‘RISDM' has changed in many ways, both big and small, to reach its current form. ‘RISDM' attempts to simplify the entire analysis process (from mesh creation to diagnostic plotting) but necessarily loses generality in doing so. ‘PointedSDMs' is able to accommodate more complex models (e.g. modelling of marked point processes). The computational strategies implemented by the ‘RISDM' and ‘PointedSDMs' packages differ: the approximation used for the conditional likelihood for the PO data in ‘RISDM' is based on the simple, yet robust, gridding method (section ‘The integrated species distribution model in ‘RISDM') whereas ‘PointedSDMs' is based on the method of Simpson et al. (2016). The gridding method conveniently allows for specification of the amount of suitable habitat area, within each grid-cell, for the species with strong habitat preference (e.g. arboreal species).

In ‘RISDM', functions are provided that aid steps in the full analysis pathway, from model set-up, through model fitting, model diagnostics and prediction. ‘RISDM' also allows flexible model specification with user-specified fixed effect models for each data source. We have found that, from our early and on-going applications that the ‘RISDM' package is: effective, general, usable, and easy to learn for new users.

## The integrated species distribution model in ‘RISDM'

A brief overview of the model implemented in ‘RISDM' is presented here, and readers are encouraged to read Fletcher et al. (2019) and Isaac et al. (2020) for a more detailed introduction. For visually-inspired readers, we have provided Fig. 1 as a parallel model description. The core idea of the model is that the patterns, within each data type, can be separated into two components: 1) a representation of the species' distribution, and 2) a description of variation that is not directly related to the species' distribution. The species distribution is shared between all the different data-types, whereas the data type-specific model is not shared beyond the bounds of a data type.

**); the number of individuals per areal unit) at site**

*s***:**

*s*where ** x**(

**) is a vector of covariates, β a vector of parameters and**

*s**u*(

**) is a Gaussian random effect with zero-mean that is correlated over space according to a Matérn covariance function. The presence of the spatial random effect will attempt to capture spatially-smooth deviations from the fixed effect model, and also to appropriately adjust the estimation process so that autocorrelation is accounted for. One notable argument for including a random effect is to account for missing, but important, spatially-smooth covariates.**

*s**sampling bias*(Dennis and Thomas 2000, Reddy and Dávalos 2003, Hortal et al. 2008), which arises from unmeasured human-induced variation in the sampling effort. Sampling bias can occur from a wide variety of processes, including how difficult it is to travel to a particular location (Dennis and Thomas 2000), conservation status of the location (Reddy and Dávalos 2003) and evolving recording procedures (changing species lists, Hortal et al. 2008). An ISDM attempts to model this bias by adding covariates into the model for the

*PO expectation only*. These covariates are defined over the entire study region. In particular, the model for PO data is

where log(λ(* s*)) is from Eq. 1,

*x*_{1}(

*) and β*

**s**_{1}are the bias analogues of

**(**

*x**), β in (1), and*

**s***A*

_{1}(

*) is the amount of suitable habitat at location*

**s***. In Fig. 1, (*

**s**

*x*_{1}(

*)*

**s**^{T}β

_{1}) adds only to the model for the PO data and are similar to those suggested in Warton et al. (2013). However, as there are multiple independent sources of information for the species' distribution, the ISDM can disaggregate the effect of a covariate on distribution and bias, and thus the additive assumption of environmental and observer bias variables made in (Warton et al. 2013) is not necessary.

*sampling artefacts*and can be included into the PA and AA data models by extending the distribution model for each data type:

In Eq. 3, model quantities π* _{i}* and $$ are the probability of success for the

*i*th PA datum and the intensity for the

*j*th AA datum, respectively. Both π

*i*and $$ are a function of the distribution model λ(

*) at location*

**s***and*

**s**_{i}*respectively. The covariates*

**s**_{j}

**w**_{2,}

*and*

_{i}

**w**_{3,}

*are defined at the locations of the PA and AA data respectively, but are not required at other spatial locations. These terms are added only within each data type and affect only those data types (Fig. 1). Also added to the PA and AA data expressions are offsets that describe the amount of area sampled for each PA or AA observation (notated as log(*

_{j}*A*

_{2,}

*) and log(*

_{i}*A*

_{3,}

*)). These terms are needed to ensure that the PA and AA data have a similar scale to the PO data and also to account for heterogeneity in measurement areas.*

_{i}The model is completed with the specification of a link function and a likelihood for each data type (Fig. 1). The link functions and the likelihoods follow directly from the conditional Poisson process assumption (Fletcher et al. 2019, Isaac et al. 2020) and are outlined in Fig. 1 (Eq. 2–3). The ISDM requires approximation of the conditional likelihood for a Poisson point process to obtain posterior distributions. ‘RISDM' employs strategies that are used elsewhere (e.g. Illian et al. 2012); the approximation to the point process is obtained by gridding – under the model, the number of points within each grid cell is conditionally a Poisson variable. Using this approximation, the amount of habitat area for the PO model (Eq. 2) is the area of habitat within each grid cell.

Once the ISDM has been estimated (*using isdm()*) and checked (e.g. using *plot()* to show residuals), then the final goal of prediction can be approached. In ‘RISDM', this is performed using the function *predict()*, which draws and summarises samples from the model's posterior distribution: [λ(* s*)|data] (Fig. 1). Usually, the predictions are produced based on Eq. 1 such that the predictions are adjusted and are free from the effects of sampling bias and sampling artefacts.

Inference for the model's parameters is based on Bayesian principles. In particular, ‘RISDM' utilises the INLA posterior approximation and the ‘INLA' package (Rue et al. 2009). ‘INLA' provides a relatively computationally inexpensive method to obtain inferences from the Bayesian model. For the spatial random effect, ‘RISDM' utilises the method of Lindgren et al. (2011), which further reduces the computational burden imposed by estimation of spatial random effects. It does so by providing a sparse representation of space, which enables larger models to be fitted quicker but also necessitates the specification of a ‘mesh' over which the spatial random effects are evaluated. If ‘RISDM' users do not already have ‘INLA' installed on their computer, they will need to do so. They should follow the instructions found at www.r-inla.org/download-install (accessed 2 November 2023).

## The ‘RISDM' package

The ‘RISDM' package is freely available from the first author's github repository. The version used to produce this document is 1.2.15. The latest version can be installed by

$$

The ‘RISDM' package exports the functions and methods listed in the Supporting information. All functions have a core set of mandatory arguments with possibly useful defaults for others. Users may find defaults adequate, but finer control can be taken by specifying any or all arguments. In the following section we illustrate the utility of the ‘RISDM' package through a worked example.

### Synthetic example

We chose to analyse simulated sets of data (for PO, PA and AA) for illustration. The covariate data are real, and taken as a 1 km grid over the Australian Capital Territory (ACT). The arbitrarily chosen covariates are presented in Supporting information and are the soil moisture at the root zone (SMRZ, Frost et al. 2018) and the average minimum monthly temperature (TEMP, Harwood et al. 2016). The sampling bias in the PO data follows the log of accessibility (lACC, Weiss et al. 2018), see also Supporting information. Accessibility was adjusted to be more appropriate to the Australian context, where population centres are often smaller and more dispersed. All of these covariates are part of the ‘RISDM' package. The *simulateData.isdm()* function is used to construct these synthetic data.

$$

Illustrative model constructs are displayed in the Supporting information. For real data, the following constructs would not be available: 1) the realised random effects, 2) the actual intensity surface that describes the distribution of the species, and 3) the biased intensity surface that includes the sampling effort bias. It is important to note that the intensity and the biased intensity are different, with the biased intensity being the product of the intensity and the observation process (a function of lACC in the simulation). Taken on face value, the biased intensity suggests that there are higher densities of the species in the north the ACT, when there is not.

The coefficients for the SMRZ and TEMP variables are β = (0.5, −0.75)^{T}. No intercept is given, as *simulateData.isdm()* internally scales the intensity surface to the user-specified number of individuals across the study area. The bias model contains an intercept (−2), and the lACC coefficient is β_{1} = −0.75, indicating that far away places have less search effort. Using these values, the simulated data are presented in Supporting information.

### An analysis mesh

The method of Lindgren et al. (2011) relies on creating a spatial mesh, which represents a reduced set of spatial nodes on which to perform the spatial calculations. The nodes must be specified and these choices can lead to changes in inference (Righetto et al. 2020, Dambly et al. 2023). Meshes with more nodes produce models with less bias and more accurate predictions (Righetto et al. 2020), whilst also reducing overly smooth spatial predictions (Dambly et al. 2023). The ‘INLA' package (Rue et al. 2009) provides functions to generate meshes, but we think that a re-parameterisation removes a potential road-block for new or intermittent users. In the ‘RISDM' function *makeMesh()*, the mesh creation task is primarily parameterised by the analyst's guess as to the likely effective range for the spatial random effect (distance to small ~ 0.1 expected correlation) and the number of nodes. For finer control, other arguments can also be given. If not supplied, then defaults are set using rules-of-thumbs obtained largely from Bakka (2018) and Krainski et al. (2019).

A ‘good' mesh is one that covers the region with triangles that are approximately equally sized and are approximately equilateral (Krainski et al. 2019). Further, the edge lengths should be relatively small compared to the (posterior) range of spatial dependence – otherwise the model might erroneously report that there is no dependence in the data when there is. If the edge lengths are too small however, then excessive computation may be needed. The first two conditions can be checked visually using a plot of the mesh (shown in Krainski et al. 2019) but also looking at the distribution of triangle areas and angles. The ‘RISDM' function *checkMesh()* produces these plots, Supporting information for the ACT synthetic example, which we consider to be an adequate mesh (albeit quite coarse).

$$

### Posterior estimation

Once the mesh is created, the data and model can be combined to estimate parameter posteriors. For the species' distribution, we add SMRZ and TEMP as linear terms into the distributionFormula argument. All intercepts are data type specific, and are included as part of the artefactFormulas or biasFormula arguments. There is no global intercept, in the distribution model, as this would be collinear with the data-specific intercepts. The sampling bias formula, that describes the observation process for PO data, is specified only by log-accessibility to allow for probable search patterns. The simulated PA and AA data are from randomised surveys and so, the artefactFormulas list contains only an intercept term for PA and/or AA. The area sampled for each of the PA observations is stored in the ‘transectArea' column of each of the PA and AA data.

The model is completed with specification for the prior distributions for the parameters. By default, ‘RISDM' standardises all covariates (after basis expansion), which can stabilise numerical issues, and simplifies specification of vague priors. For the synthetic data, we use the default priors for the fixed effects (zero mean and intercept standard deviation equal to 1000, and covariate standard deviation of 10). The function *isdm()* uses the complexity penalty priors (Simpson et al. 2017) for the hyper-parameters defining the distribution of the spatial random effect. In this example, we choose that the effective range has probability of 0.1 of being less than 0.5 km; the distance was chosen based on the resolution of the raster. The prior for the spatial standard deviation is specified as having probability of 0.1 of being greater than 2.

The summary method for isdm objects aids interpretation of the model. For the model for the ACT synthetic data, the fixed effects suggest that the distribution of the species varies postitively with SMRZ and negatively with TEMP. The credible intervals for these distribution parameters contain the simulated values (0.5 and −0.75). The lACC variable negatively affects observation bias, and the credible interval also contains the simulation value (−0.75). The spatial random effect has posterior estimates for standard deviation and range whose credible interval covers the simulation values (0.5 and 5 km).

$$

$$

### Model diagnostics

Residual analysis provides an interpretation of the data through the lens of the model and can be used as an interpretative guide to model behaviour (Neter et al. 1996). Within ‘RISDM', the model's fit to data from each data type is summarised using randomised quantile residuals (RQR, Dunn and Smyth 1996). We note that RQR residuals should be normally distributed (Dunn and Smyth 1996) and should show no patterning through space or with covariates (Neter et al. 1996). Residuals for the model for the ACT synthetic data are given in the Supporting information. Note that the plot method calls the residuals method internally. For more detailed residual analysis users should use residuals from the residuals method directly.

$$

The residuals for all three data types show no obvious mean-variance relationship (Supporting information, centre panels) and they appear to be normally distributed (Supporting information, right panels). There does not appear to be spatial patterning for the PO residuals either (Supporting information, bottom left panel). It is not surprising that the residuals appear adequate for these data – they are synthetic data and the analysis model matches the simulation model. In real applications, to achieve a better model-data fit, users should try to follow the advice for simpler regression-type models (Neter et al. 1996). This includes plotting residuals against covariates to check functional forms, and plotting spatially to investigate spatial patterns.

### Posterior predictions

The approach to prediction in ‘RISDM' is to sample from the approximate posterior distributions for the model's parameters and spatial random effects (as described in Krainski et al. 2019). Recall, from section ‘The integrated species distribution model in ‘RISDM', that there is a single model with multiple likelihood for each data type. This means that there is only one posterior for each parameter and so no combining of posteriors is required. We use 5000 posterior samples in the ACT synthetic example but could use fewer for testing (eases computation).

There are a number of choices to be made, including which intercept to use (either the AA, PA or PO intercept for the synthetic example) and which model components to include. The default method, used to produce Fig. 2, includes the spatial random effects but not the sampling bias or any sampling artefacts; these terms are not informative about the distribution, only the way in which the species was observed. In Fig. 2, we present posterior predictions for the intensity of the point process (λ(* s*) in section ‘The integrated species distribution model in ‘RISDM'). The posterior is summarised using the posterior's lower and upper 95% confidence limits and the posterior median. The maps of intensity predictions, Fig. 2, show that there are few areas where the model predicts confidently that the species is present and other locations where there is large uncertainty (wide intervals). Reassuringly, the patterns in the predicted maps match those used to simulate the data values themselves, compare Fig. 2, Supporting information, centre bottom.

$$

## Summary and discussion

Integrated species distribution models (ISDMs) are a recently emerged class that have considerable potential (Fletcher et al. 2019, Isaac et al. 2020). They use multiple sources of data and leverage the strengths of each type to mitigate potential and inherent bias. To facilitate fitting of ISDMs, we have introduced the R-package ‘RISDM', which was designed to provide a consistent and unified interface for a wide variety of data inputs and models whilst also providing a reasonably simple and intuitive argument structure. We rely heavily on the INLA (Rue et al. 2009) machinery to estimate and predict the models.

It is our intention to expand the capabilities of ‘RISDM' as needed. This includes, for example, including different data types (e.g. occupancy data, continuous measures, ordinal abundance and so on) and spatio-temporal random effects.

To cite ‘RISDM'or acknowledge its use, cite this software note as follows, substituting the version of the application that you used for ‘version 1.0':

Foster, S. et al. 2023. ‘RISDM': Species distribution modelling from multiple data sources in R. – Ecography 2023: e06964 (ver. 1.0).

### Acknowledgements

– We would like to thank Katherina Ng, Phil Tennant and Miles Keighley.

### Funding

– This work was part of The National Vertebrate Pests and Weeds Distribution project, which was funded by the Australian Government Department of Agriculture, Fisheries and Forestry's Established Pest Animals and Weeds Management Pipeline Program and Supporting Communities Manage Pests and Weeds Program.

### Author contributions

**Scott D. Foster:** Conceptualization (lead); Methodology (lead); Software (lead); Writing – original draft (lead); Writing – review and editing (lead). **David Peel:** Conceptualization (equal); Methodology (supporting); Writing – original draft (supporting); Writing – review and editing (supporting); Investigation (supporting); Software (supporting). **Geoffrey R. Hosack:** Conceptualisation (equal); Methodology (equal); Writing – original draft (supporting); Writing – review and editing (supporting). **Andrew Hoskins:** Conceptualisation (supporting); Methodology (supporting); Investigation (supporting); Writing – original draft (supporting); Writing – review and editing (supporting). **David J. Mitchell:** Data curation (equal); Project administration (equal); Writing – original draft (supporting); Writing – review and editing (supporting). **Kirstin Proft:** Data curation (equal); Project administration (equal); Writing – original draft (supporting); Writing – review and editing (supporting). **Wen-Hsi Yang:** Investigation (supporting); Software (supporting); Validation (supporting); Writing – review and editing (supporting). **David Uribe-Rivera:** Investigation (supporting); Software (supporting); Validation (supporting); Writing – review and editing (supporting). **Jens Froese:** Conceptualization (equal); Project administration (equal); Resources (equal); Supervision (equal); Writing – original draft (supporting); Writing – review and editing (supporting).

## Open Research

# Transparent peer review

The peer review history for this article is available at https://publons.com/publon/10.1111/ecog.06964.

# Data availability statement

The RISDM package contains all the base code used in this work and the data for the example. The version of the package used to create this document is https://zenodo.org/doi/10.5281/zenodo.10296063 (Yang and Foster 2023). The package will be updated into the future and future versions will be available from https://github.com/Scott-Foster/RISDM.