This vignette will demonstrate some of the basic functionality of the assignR package for inference of geographic origin from isotopic data, and show how these tools can be used to evaluate decisions made during the design of movement ecology projects. The vignette assumes you have the current release of assignR installed from the CRAN repository.
assignR assumes that the user will be calibrating an existing isoscape to represent tissue isotope values for their species of interest, not producing a new isoscape directly from tissue data. Isoscapes and known-origin tissue data can be input by the user directly or pulled from data objects included with the package. A range of data analysis workflows are supported, as introduced in the figure below and described in Ma et al., 2020.
Figure 1: Hypothetical workflows in assignR
A vignette demonstrating the full feature set of the package can be found here.
Goal - Demonstrate how assignR tools can be used to explore the sensitivity of your analysis results to different choices of training data.
Premise - Let’s assume that you are working on a project to assess the movement of Setophaga petechia, the American yellow warbler. Published known-origin data for this taxon exist and are included in the assignR database. But what if you wanted to collect and use additional data from a related species to add to this dataset or create a new training dataset specific to your project? Based on existing data, how reasonable would this taxonomic substitution be, and how much might it affect your results?
We’ll start by extracting the known-origin data for Setophaga petechia.
library(assignR)
td_pet = subOrigData(taxon = "Setophaga petechia", mask = naMap)
Are these samples representative of the species’ range? Do they sample the expected range of isotopic variability?
library(raster)
library(rgdal)
library(sp)
#prepare a precipitation H isoscape approximating breeding range
d2h_range = crop(d2h_world, naMap)
pet_range = readOGR("data_0.shp") #file obtained from source shown in plot below
pet_range = pet_range[pet_range$SEASONAL == 2,]
plot(d2h_range$mean)
plot(pet_range, border = "red", add = TRUE)
#The line above won't work unless you download the breeding range shapefile to your working directory
plot(td_pet$data, add = TRUE)
text(-165, 13, "BirdLife International and Handbook of the Birds of the World (2017).",
pos = 4, cex = 0.5, col = "red")
text(-165, 11, "The IUCN Red List of Threatened Species. Version 2021-1.", pos = 4,
cex = 0.5, col = "red")
text(-165, 9, "https://www.iucnredlist.org. Downloaded on 5-18-2021.", pos = 4, cex = 0.5,
col = "red")
Now we want to calibrate the precipitation isoscape using the training data.
d2h_pet = calRaster(td_pet, d2h_range)
##
##
## ---------------------------------------
## ------------------------------------------
## rescale function uses linear regression model,
## the summary of this model is:
## -------------------------------------------
## --------------------------------------
##
## Call:
## lm(formula = tissue.iso ~ isoscape.iso[, 1], weights = tissue.iso.wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -34.779 -7.267 -3.865 5.463 50.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.79448 8.65936 -3.325 0.00234 **
## isoscape.iso[, 1] 0.58924 0.09197 6.407 4.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.47 on 30 degrees of freedom
## Multiple R-squared: 0.5778, Adjusted R-squared: 0.5637
## F-statistic: 41.05 on 1 and 30 DF, p-value: 4.495e-07
#for those new to RStudio and running this code, the function will produce a series
#of plots in your plot window...you can use the forward/back arrow buttons to view them
In our study we are considering using training data for a congeneric taxon, the American redstart, to supplement or in place of the yellow warbler data. We have access to a small number of these data, also:
td_rut = subOrigData(taxon = "Setophaga ruticilla", mask = naMap)
d2h_rut = calRaster(td_rut, d2h_range)
##
##
## ---------------------------------------
## ------------------------------------------
## rescale function uses linear regression model,
## the summary of this model is:
## -------------------------------------------
## --------------------------------------
##
## Call:
## lm(formula = tissue.iso ~ isoscape.iso[, 1], weights = tissue.iso.wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -37.209 -5.768 1.670 7.697 36.981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.0583 10.8103 -3.151 0.00419 **
## isoscape.iso[, 1] 0.5275 0.2069 2.550 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.94 on 25 degrees of freedom
## Multiple R-squared: 0.2064, Adjusted R-squared: 0.1747
## F-statistic: 6.502 on 1 and 25 DF, p-value: 0.01728
Notice that the S. ruticilla dataset has a somewhat different geographic distribution and its correlation with the precipitation isoscape values is weaker, but the regression relationship is not hugely different. We can make a direct comparison of the results:
plot(d2h_rut$isoscape.rescale$mean - d2h_pet$isoscape.rescale$mean, main = "Difference between isoscapes")
The next step in the assignR workflow is to evaluate the probability of one or more unknown samples having originated at any location shown on the tissue isoscape. Here we will use two hypothetical samples that are generated by adding a small random error to the tissue isoscape values.
unk_pts = data.frame(c(-80, -110), c(38, 54))
set.seed(1234) #ensures the same random numbers for each user
unk = extract(d2h_pet$isoscape.rescale$mean, unk_pts) + rnorm(2)
unk = data.frame("ID" = c("A", "B"), "d2H" = unk)
Now we’ll calculate the posterior probability map for each sample.
pd_pet = pdRaster(d2h_pet, unk)
We can compare these results with the probabilities that we would obtain using the alternative training data.
pd_rut = pdRaster(d2h_rut, unk)
pd_diff = pd_pet - pd_rut #calculates the difference between the probability maps
layout(matrix(c(1, 2), ncol = 2))
par(mar = c(4, 4, 1, 8))
plot(pd_diff[[1]], main = "A")
points(unk_pts[1,])
plot(pd_diff[[2]], main = "B")
points(unk_pts[2,])
Although probability maps like those just shown are very useful, workers often wish to ‘assign’ their unknown samples to a subregion of the study area. In assignR we can do this in one of two ways, illustrated by the two examples below.
#Assign each sample to the 20% of the study area that has the highest posterior probability values
pet_byArea = qtlRaster(pd_pet, 0.2)
#Assign each sample to an area which contains the highest probability grid-cells and includes 50% of total posterior probability
pet_byProb = qtlRaster(pd_pet, 0.5, thresholdType = "prob")
Again, we could compare with the analysis that used the congeneric training data.
rut_byArea = qtlRaster(pd_rut, 0.2, genplot = FALSE)
rut_byProb = qtlRaster(pd_rut, 0.5, thresholdType = "prob", genplot = FALSE)
layout(matrix(c(1, 2), ncol = 2))
par(mar = c(4,4,4,1))
plot(pet_byArea[[1]], col = c(rgb(0.05,0.05,0.05,0.1), rgb(0,0.6,0)),
main = "S. pet. by area", legend = FALSE)
points(unk_pts[1,], pch = 21, bg = "white")
plot(pet_byArea[[2]], col = c(rgb(0,0,0,0), rgb(0,0,1)), add = TRUE,
legend = FALSE)
points(unk_pts[2,], pch = 21, bg = "white")
plot(rut_byArea[[1]], col = c(rgb(0.05,0.05,0.05,0.1), rgb(0,0.6,0)),
main = "S. rut. by area", legend = FALSE)
points(unk_pts[1,], pch = 21, bg = "white")
plot(rut_byArea[[2]], col = c(rgb(0,0,0,0), rgb(0,0,1)), add = TRUE,
legend = FALSE)
points(unk_pts[2,], pch = 21, bg = "white")
layout(matrix(c(1, 2), ncol = 2))
par(mar = c(4,4,4,1))
plot(pet_byProb[[1]], col = c(rgb(0.05,0.05,0.05,0.1), rgb(0,0.6,0)),
main = "S. pet. by prob", legend = FALSE)
points(unk_pts[1,], pch = 21, bg = "white")
plot(pet_byProb[[2]], col = c(rgb(0,0,0,0), rgb(0,0,1)), add = TRUE,
legend = FALSE)
points(unk_pts[2,], pch = 21, bg = "white")
plot(rut_byProb[[1]], col = c(rgb(0.05,0.05,0.05,0.1), rgb(0,0.6,0)),
main = "S. rut. by prob", legend = FALSE)
points(unk_pts[1,], pch = 21, bg = "white")
plot(rut_byProb[[2]], col = c(rgb(0,0,0,0), rgb(0,0,1)), add = TRUE,
legend = FALSE)
points(unk_pts[2,], pch = 21, bg = "white")
Goal - Demonstrate the assignR QA tool, which supports quantification and comparison of the quality of isotope-based geographic assignments.
Premise - Before you embark on your Setophaga petechia project, you would like to figure out whether H or O isotope ratios are likely to be useful in your study. You have a specific goal for this work - to distinguish whether individuals were breeding north or south of the US/Canada border. Although you realize that you won’t be able to confidently answer this question for all individuals, for the method to be useful you need to be able to confidently exclude one country or the other (~50% of the study area) for many individuals. Other researchers have used H isotopes in previous work, are they likely to work for you? What about O isotopes, are they likely to work better?
The assignR function QA is designed to help address such questions by using known origin data to estimate the quality and strength of assignment results generated using a given baseline isoscape and training data set. It re-samples the known origin data iteratively to create multiple training and validation datasets, assigns the validation data, and evaluates the results relative to the known locations of origin.
QA_H = QA(td_pet, d2h_range, valiStation = 3, name = "H")
## Warning: CRS object has comment, which is lost in output
## known was reprojected
plot(QA_H)
The resulting plots offer several different summaries of the probability and assignment results, but most relevant for the question we’ve framed above are the first and third. The first plot summarizes the distribution of posterior probabilities over the analysis area…the more uneven the distribution (the more there are areas of very high and very low probability) the more geographically specific we expect our result to be. Looking at the plot, we would expect to be able to exclude 50% of the study area (0.5 on the y-axis) a little more than 90% of the time (0.9 on the x-axis). The third plot can be read in a similar way, but is based on the assignment results for the validation samples rather than the model-estimated probabilities. It confirms the probability-based estimate: we are actually able to ‘correctly’ exclude 50% of the area (0.5 on the x-axis) more than 90% of the time.
Is this good enough for your study? You have to decide.
Let’s compare this with the results for O isotope-based assignment.
td_pet_O = subOrigData(taxon = "Setophaga petechia", marker = "d18O",
ref_scale = "VSMOW_O", mask = naMap)
FALSE mask was reprojected
FALSE 11 samples are found from 7 sites
FALSE 11 samples from 7 sites in the transformed dataset
Notice we have many fewer data, which may impact our results. We will only hold back validation data from 1 station at a time (~10% of the data), similar to the proportion we used for the H analysis.
QA_O = QA(td_pet_O, d18o_world, mask = naMap, valiStation = 1, name = "O")
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in Proj4
## definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in Proj4
## definition
## Warning: CRS object has comment, which is lost in output
## known was reprojected
plot(QA_H, QA_O)
FALSE NULL
Using the same logic discussed above, these results suggest that isotopes of H provide the stronger geographic constraints and are more likely to be useful in our research than those of O. We’d have to look deeper to fully understand what we are seeing…does O actually carry less geographic information (there are studies that have suggested this is the case for some wildlife systems) or is the result a function of the lower number of known origin training data for O? We could directly assess the latter possibility by extracting H from only the sites/samples for which we have O data and re-running this analysis. Moreover, we could use similar thinking to determine the implications of other project design decisions (e.g., how does assignment quality change if we have fewer training data, or if their geographic distribution is more limited?).
Although assignR is designed primarily to help you analyze and make interpretations from your own study data, it can also be used to experiment with different study design decisions and assess their likely impact on your work before you launch your project. This can be done using known origin data from the database distributed with the package, published or pilot data you import yourself, or even synthetic data that you produce to represent expected or idealized properties of the data you expect for your study system. Here we’ve introduced two basic approaches - experimenting with the probability-of-origin or assignment process manually and qualitatively or quantitatively assessing the results obtained under different conditions, or using the QA tool to automate this process and produce and compare summary metrics.
As with any software, you will likely discover things that you wish you could do with assignR but are not currently supported. The package is being developed as open source software. You are always welcome to modify the underlying code to achieve a new goal that better suits your needs. If you do come up with an idea or modification that you think might be valuable to others, please either provide a suggestion using the “Issues” feature at our GitHub repository or make a pull request and contribute your new code to the project!