Last updated: 2022-01-20

Checks: 6 1

Knit directory: Barley1kGBS_Proj/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you'll want to first commit it to the Git repo. If you're still working on the analysis, you can ignore this warning. When you're finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

The command set.seed(20210831) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b34d099. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/GenomicVarPartitioning_cache/

Untracked files:
    Untracked:  VennDiagram2022-01-20_09-51-27.log

Unstaged changes:
    Modified:   analysis/GenomeEnvAssociation.Rmd
    Modified:   analysis/PreprocessEnvData.Rmd
    Modified:   analysis/ResistanceSurface.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ResistanceSurface.Rmd) and HTML (public/ResistanceSurface.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 5e688f9 Che-Wei Chang 2022-01-18 Build site.
html b38dcf2 Che-Wei Chang 2022-01-18 Build site.
html 5f53139 Che-Wei Chang 2022-01-18 Build site.
Rmd 597a627 Che-Wei Chang 2021-11-30 update analyses. unPC, ResistanceGA and GO enrichment
html 597a627 Che-Wei Chang 2021-11-30 update analyses. unPC, ResistanceGA and GO enrichment

Introduction

In this analysis, we aim to answer that whether the resistance of geography can explain genetic differentiation better than geographical distance. To do so, we follow the framework of Peterman et al. (2014). This analysis can be conducted by ResistanceGA package (https://github.com/wpeterman/ResistanceGA).

About ResistanceGA

ResistanceGA is used to optimize the transformation between landscape features and resistance surfaces. In brief, it optimizes the parameters of the transformation equation with the genetic algorithm which takes genetic distances and resistance distances as response variables and independent variables respectively. After optimization, it takes transformed values and genetic distances to fit a linear model in order to evalute relative support of resistance surfaces.

Data preparation

Genetic distances between populations

To begin with, we need the genetic distances between accessions from pairs of collection sites.

Genetic distance based on shared allele proportion

Considering that the number of polymorphic loci can be very different among pairs of comparison, intead of \(F_{ST}\), we would use the 1 - proportion of shared alleles (\(D_{PS}\); Bowcock et al. 1994) to represent the genetic distances between samples from a pair of collection sites. Peterman et al. (2014) also preferred this approach for calculating genetic distances:

... Because Dps is a genetic measure with fewer assumptions than Fst, we focus on the results from Dps in the rest of the paper. ...

However, Bowcock et al. (1994) used this approach to measure genetic distances between individuals not populations.

Originally, \(D_{PS}\) is calculated based on the number of shared alleles summed over loci/(2 \(\times\) number of loci compared):

\(1-\frac{\Sigma N_{shared}}{2\times N_{total}}\),

where \(N_{shared}\) represents the number of shared alleles at each locus, and \(N_{total}\) is the total number of loci. We may define the shared allele proportion as the minina of allele frequency of two populations as in the function pairwise.propShared in PopGenReport (Gruber 2019). For example, given a biallelic locus with \(A_1\) allele frequency 0.9 in pop1 and 0.4 in pop2, the shared proportion of \(A_1\) at this locus would be 0.4.

Because two of 58 collection sites have only one accession, pairwise.propShared would run into errors of incompatible data structure.

To compute \(D_{PS}\) between populations, we define a function cal.pairwise.Dps which takes genind object as an input. The genind object saves two allele genotypes of a locus in two separate columns. This kind of structure is more convenient for our computation.

cal.pairwise.Dps <- function(gi){
  if(is.null(pop(gi))){stop("Please assign genotypes to corresponding populations.")}
  message("transforming data...")
  pops <- seppop(gi)
  g.list <- lapply(pops, FUN = function(x){
    apply(x@tab, 2, function(y){
      if(all(is.na(y))){
        return(NA)
      }else{
        # calculate the mean allele count by loci
        out <- mean(y, na.rm = TRUE)
        return(out)
      }
    }) # apply end
  }) # lapply end
  g.matrix <- matrix(unlist(g.list), ncol = length(g.list))
  colnames(g.matrix) <- names(g.list)
  #rownames(g.matrix) <- colnames(pops$Group1$tab)
  pairs.comb <- combn(1:ncol(g.matrix), 2)
  rm(pops,g.list)
  message("calculating distance...")
  out.matrix <- matrix(NA, ncol = ncol(g.matrix), nrow = ncol(g.matrix))
  rownames(out.matrix) <- colnames(out.matrix) <- colnames(g.matrix)
  for(i in 1:ncol(pairs.comb)){
    sel1 <- pairs.comb[1,i]
    sel2 <- pairs.comb[2,i]
    p <- g.matrix[,c(sel1, sel2)]
    # shared proportion based on the minima allele frequency -> see 
    min.af <- 
      apply(na.omit(p), 1, function(x){
        return(min(x))
    })
    d <- mean(min.af, na.rm = T)
    out.matrix[sel1, sel2] <- d
    out.matrix[sel2, sel1] <- d
  }
  diag(out.matrix) <- 1
  out.matrix <- 1 - out.matrix
  return(as.dist(out.matrix))
} # cal.pairwise.Dps end

Calculate genetic distance

source("./code/Rfunctions.R")

library(vcfR)
library(adegenet)
vcf <- read.vcfR("./data/GBS_244accessions_LDpruned_19601SNP.vcf.gz", verbose = F) # load VCF
grp.index <- sapply(strsplit(colnames(vcf@gt[,-1]), split = "_"), function(x){x[1]}) # get subpops

gind <- vcfR2genind(vcf) # convert VCF to genind
pop(gind) <- as.factor(grp.index)

Dps <- cal.pairwise.Dps(gi = gind)
saveRDS(Dps, "./output/Dps_wild_barley.RDS")

Geographic coordinate

To run ResistanceGA, we also need geographical coordinates.

env <- read.table("./data/RawEnv_244accessions.tsv", header = T)
Dps <- readRDS("./output/Dps_wild_barley.RDS")

# get the central coordinates of 58 groups (collection sites) -> x in the 1st col and y in the 2nd col
centroid <- 
  data.frame(
    Lon = tapply(env$Longitude, INDEX = as.factor(env$GeoGroup), mean),
    Lat = tapply(env$Latitude, INDEX = as.factor(env$GeoGroup), mean)
  )

# sorting the coordinates to match the genetic distance matrix
centroid <- centroid[match(colnames(as.matrix(Dps)) ,rownames(centroid)),]

write.table(centroid, file = "./output/Geocoord_pop.csv", sep = ",", 
            col.names = TRUE, row.names = TRUE)

Geographic data

To calculate resistance distances, we considered three landscape features:

  • elevation
  • slope
  • water body

Elevation

To produce a resistance surface of geography, we take the elevation data (.tif) from the SRTM database (https://srtm.csi.cgiar.org/).

  • ISR_SRTM_GeoElevation.tif

Slope

Slopes were calculate based on the elevation data by using terrain function in the raster.

Water body

The data of water body is from Global Surface Water(Pekel et al. 2016; doi:10.1038/nature20584).

water <- raster("./data/occurrence_30E_40N_v1_1.tif")
water <- crop(x = water, y = elev@extent) # match raster 
writeRaster(water, filename = "./output/water_body_SLevant.tif")

Run ResistanceGA

To evaluate geographical resistance, we would need Circuitscape(McRae et al.) to identify paths with the least resistance and calculate corresponding resistance distances. The Circuitscape package in Julia (https://docs.circuitscape.org/Circuitscape.jl/latest/) was published recently and it is used by ResistanceGA. Considering the considerable computational resources may required, we won't run ResistanceGA on a personal computer.

Installation of the required packages

## IMPORTANT NOTE:
# Because our server has only C++11, some errors happened when I tried to install "ResistanceGA". Specifically, the latest "ResistanceGA" (ver 4.2) requires the latest "GA" package (>= 3.2.1) that depends on the latest "Rcpp" requiring C++14. However, "Rcpp" and other R packages installed on our server are all depend on C++11. To avoid messing up everything, a less risky solution is to force "ResistanceGA" to accept an old version of "GA" (ver 3.2; published in 2019 Jan).
# This approach should not influence the function of "ResistanceGA" because it has been published in 2018 and we let it depend on "GA" published in 2019.
# To do so, I download .zip from the github of "ResistanceGA" and modify the `DESCRIPTION` file. Then, I build the R package locally with the command below:

#devtools::build(path = "~/barley/temp_ResistGA/", pkg = "~/barley/temp_ResistGA/ResistanceGA", vignettes = FALSE) -> make a local package that has a requirement of GA with only >=3.0.2

# After building the package, I upload `ResistanceGA_4.2.tar.gz` to our server and install it. You can find the command used to install the package below

## Install packages on our server
install.packages("devtools", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
https://cran.r-project.org/src/contrib/Archive/spdep/spdep_1.0-2.tar.gz
install.packages("https://cran.r-project.org/src/contrib/Archive/terra/terra_0.5-2.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/raster/raster_2.7-15.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/GA/GA_3.2.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/") # install an old version -> we don't have CXX14 and I don't want to bother to install local g++... 
install.packages("https://cran.r-project.org/src/contrib/Archive/nloptr/nloptr_1.0.4.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/lme4/lme4_1.1-14.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/htmltools/htmltools_0.5.2.tar.gz", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("sass", lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/rgdal/rgdal_1.2-8.tar.gz",
                 lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")
install.packages("https://cran.r-project.org/src/contrib/Archive/JuliaCall/JuliaCall_0.17.1.tar.gz",repos = NULL,
                 lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")

.libPaths( c( "/work/workspaces/chewei/bin/R/R-3.5.2/library/" , .libPaths() ) )
library(withr)
library(devtools)
## install ResistanceGA that accepts a lower version of GA
install.packages("ResistanceGA_4.2.tar.gz", repos = NULL, type="source",
                 lib="/work/workspaces/chewei/bin/R/R-3.5.2/library/")

ResistanceGA is able to take multiple surfaces to form a composite resistance surface. We would take four geographic data (elevation, slope, and water body) as inputs to generate such a composite resistance surface and calculate resistance distances of geography accordingly. The codes below are all executed on our server.

Because the ResistanceGA with Circuitscape would run into error with unknown reasons on our server, we decided to use the communte distance method, which is an alternative method provided in ResistanceGA package.

  • Before running the analysis, we reduced the resolution of raster objects by using aggregate.
  • We selected Inverse Ricker and Inverse-Reverse Monomolecular transformation to convert elevation and slope to continuous resistance surfaces respectively.
  • We treated water body as categorical resistance surface
  • all_comb function was used to test all possible combination of given landscape features (elevation, slope and water body)
  • Resist.boot was used to perform bootstrap analyses.
# set working directory
setwd("/work/workspaces/chewei/Formal_analysis_B1K+IL+IPK/ResistGA/")

# path to local R library
.libPaths( c( "/work/workspaces/chewei/bin/R/R-3.5.2/library/" , .libPaths() ) )

# load packages
library(ResistanceGA)
library(raster)

gen.dist <- readRDS("Dps_wild_barley.RDS")
geo.coord <- read.csv("Geocoord_pop.csv", header = TRUE)
elev <- raster("ISR_SRTM_GeoElevation.tif")
slope <- raster("slope_SLevant.tif")
water <- raster("water_body_SLevant.tif")

# the extent of water body tif is slightly smaller than others, so we use `crop` to make all tif objects have the same extent
elev <- crop(x = elev, y = extent(water))
slope <- crop(x = slope, y = extent(water))

# because the resolution of water body tif is different from others', we use `projectRaster` to reduce its resolution.
water <- projectRaster(from = water, to = elev)
values(water)[values(water) > 0] <- 255 # force all water body become a single category


# to reduce the required compuational resources, we use `aggregate` to reduce the resolution of tif
f = 20
elev <- aggregate(elev, fact = f, fun = mean)
slope <- aggregate(slope, fact = f, fun = mean)
water <- aggregate(water, fact = f, fun = mean)
values(water)[values(water) > 255/2] <- 255
values(water)[values(water) <= 255/2] <- 0
values(water)[is.na(values(water))] <- NA

# combine three raster into one
r.stack <- stack(elev, slope, water)

dir.create("./all.comb") # create a folder to save outputs

GA.input <- GA.prep(ASCII.dir = r.stack,
                    Results.dir = "all.comb",
                    parallel = 48,
                    select.trans = list(8, # Inverse Ricker transformation
                                        1, # Inverse-Reverse Monomolecular transformation
                                        NA) # no transformation for categorical data
)

gdist.input <- gdist.prep(n.Pops = nrow(geo.coord),
                    response = as.vector(gen.dist),
                    samples = SpatialPoints(geo.coord)
)
MS.output <- all_comb(gdist.inputs = gdist.input,
                      GA.inputs = GA.input,
                      results.dir = "/work/workspaces/chewei/Formal_analysis_B1K+IL+IPK/ResistGA/all.comb/")


saveRDS(MS.output, "ResistGA_gdist_wildbarley_geo_output.RDS")
save(GA.input, gdist.input,file = "./ResistGA_wildbarley_geo_input.Rdata")

MS.output <-readRDS("ResistGA_gdist_wildbarley_geo_output.RDS")

resist.mat <- MS.output$all.cd
k <- MS.output$all.k$k

resp <- as.matrix(gen.dist)
colnames(resp) <- rownames(resp) <- NULL
library(plyr) # required for bootstrap function

# bootstrapping with model ranking based on R^2
R2.boot <- Resist.boot(mod.names = names(resist.mat),
                        dist.mat = resist.mat,
                        n.parameters = k,
                        sample.prop = 0.75,
                        iters = 1000,
                        rank.method = "R2",
                        obs = nrow(resp),
                        genetic.mat = resp)

write.csv(R2.boot, file = "ResistGA_gdist_wild_barley_geo_bootstrap_output.csv",
          row.names = F)

quit("no")

Trouble shooting

Here are some notes regarding the errors we confronted when running ResistanceGA. We hope the information below would be helpful for someone who is also interested in using ResistanceGA.

Problem with libstdc++.so.6

When running ResistanceGA with Circuitscape in Julia, an error happened:

/lib64/libstdc++.so.6: versionGLIBCXX_3.4.21' not found`

This error is likely due to that the functions of JuliaCall cannot find the needed libstdc++ version. A solution is to export R_LD_LIBRARY_PATH in bash (see https://github.com/Non-Contradiction/JuliaCall). For example:

export R_LD_LIBRARY_PATH=/work/workspaces/chewei/julia/julia-1.6.3/lib/julia

Problem with Circuitscape

If you install the Circuitscape.jl of version <5.8.3, you may run into an error:

MethodError: no method matching connected_components(::SimpleWeightedGraphs.SimpleWeightedGraph{Int32,Float64})

This is due to a bug in Circuitscape.jl generating when it moved from LightGraphs to Graphs. There are some Julia codes do not adapt to this change in the version <5.8.3.

To solve this issue, you need to install the latest version. It can be achieve by sepcifying the package version, like:

Pkg.add(Pkg.PackageSpec(;name="Circuitscape", version="5.9.1"))

and don't forget to test the package to ensure there is no error (run Pkg.test("Circuitscape")).

Apart from this, you may get errors if there is any required Julia package is not installed or not with an appropriate version. In this case, you need to respectively look at the error messages and install each package with Pkg.add() in Julia (don't forget to test packages with Pkg.test after installation).

When I ran my analysis, there was an error always happen whenever I tried to conduct parallel compuation:

Error in unserialize(socklist[[n]]) : error reading from connection

I used the example data from the ResistanceGA package to do some tests. It seems this error only happen if I try to run ResistanceGA with Circuitscape.jl in parallel. Therefore, I switch to using communte distance, which is another approach available in ResistanceGA.

Memory requirement for computation

When performing parallel computation, users may confront error messages as below:

Error in unserialize(socklist[[n]]) : error reading from connection Calls: all_comb ... recvOneData -> recvOneData.SOCKcluster -> unserialize

This error could result from the shortage of memory if input data (raster) have high resolution. If one want to carry out parallel computation, decreasing the resolution of raster is needed. Therefore, we used aggregate function of the raster package.

Results of ResistanceGA

The final output of ResistanceGA framework includes a summary table that displays statistics to evaluate model fit, such as AIC or R2.

I used all_comb function to let ResistanceGA carry out optimization for all combinaiton of input landscape features. To further compare the model fit, Resist.boot function was used to conduct bootstrapping with 1,000 iterations. I considered AIC and R2 as the criteria for ranking models.

In the tables, \(n\) is the iteration number that a model is suggested to have the best fit. We can find that a model with the resistance surface generated by combining elevation and slope is the best if considering R2 as a selection criterion. If AIC is used as a selection criterion, a model taking water bodies to calculate resistance distances is suggested to be the best model.

Bootstrap analysis based on R2

In the tables below, the leftmost column shows resistance surfaces used in linear mixed effect models. "avg.AIC" is the average Akaike information criterion. "avg.\(R_m^2\)" is the average marginal \(R^2\). "avg.LL" is the average log-likelihood. "n" and "Percent.top" is the number and percentage of iterations that a model is selected as the best-supported model. "k" is the number of parameters fitted in a model.

surface avg.AIC avg.R2m avg.LL n Percent.top k
Elev.Slope -5725.286 0.5083 2869.643 1000 100 7
Elev.Slope.Water -5734.975 0.4414 2876.487 0 0 9
Slope.Water -5736.288 0.3311 2874.144 0 0 6
Elev -5741.162 0.3159 2874.581 0 0 4
Water -5783.527 0.3042 2894.764 0 0 3
Elev.Water -5739.697 0.3010 2875.849 0 0 6
Distance -5753.420 0.2113 2878.710 0 0 2
Slope -5730.958 0.2074 2869.479 0 0 4

Bootstrap analysis based on AIC

surface avg.AIC avg.R2m avg.LL n Percent.top k
Water -5783.397 0.3055 2894.698 968 96.8 3
Distance -5752.230 0.2115 2878.115 9 0.9 2
Elev -5739.792 0.3169 2873.896 7 0.7 4
Elev.Water -5738.073 0.3014 2875.036 12 1.2 6
Elev.Slope.Water -5732.737 0.4412 2875.369 0 0.0 9
Slope.Water -5733.853 0.3317 2872.926 4 0.4 6
Slope -5729.806 0.2077 2868.903 0 0.0 4
Elev.Slope -5723.318 0.5080 2868.659 0 0.0 7

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.36      workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       whisker_0.4      magrittr_2.0.1   R6_2.5.1        
 [5] rlang_0.4.12     fastmap_1.1.0    fansi_0.5.0      highr_0.9       
 [9] stringr_1.4.0    tools_3.6.0      xfun_0.27        utf8_1.2.2      
[13] git2r_0.26.1     htmltools_0.5.2  ellipsis_0.3.2   rprojroot_1.3-2 
[17] yaml_2.2.1       digest_0.6.28    tibble_3.1.5     lifecycle_1.0.1 
[21] crayon_1.4.1     later_1.3.0      vctrs_0.3.8      promises_1.2.0.1
[25] fs_1.5.0         glue_1.4.2       evaluate_0.14    rmarkdown_2.3   
[29] stringi_1.7.5    compiler_3.6.0   pillar_1.6.4     backports_1.1.5 
[33] httpuv_1.6.3     pkgconfig_2.0.3