Last updated: 2022-01-20

Checks: 6 1

Knit directory: Barley1kGBS_Proj/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you'll want to first commit it to the Git repo. If you're still working on the analysis, you can ignore this warning. When you're finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

The command set.seed(20210831) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b34d099. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/GenomicVarPartitioning_cache/
    Ignored:    analysis/PreprocessEnvData_cache/

Unstaged changes:
    Modified:   analysis/GenomeEnvAssociation.Rmd
    Modified:   analysis/PreprocessEnvData.Rmd
    Modified:   analysis/ResistanceSurface.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/GenomeEnvAssociation.Rmd) and HTML (public/GenomeEnvAssociation.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 5e688f9 Che-Wei Chang 2022-01-18 Build site.
html b38dcf2 Che-Wei Chang 2022-01-18 Build site.
html 5f53139 Che-Wei Chang 2022-01-18 Build site.
Rmd 597a627 Che-Wei Chang 2021-11-30 update analyses. unPC, ResistanceGA and GO enrichment
html 597a627 Che-Wei Chang 2021-11-30 update analyses. unPC, ResistanceGA and GO enrichment
html bc726c2 Che-Wei Chang 2021-10-22 Build site.
html 9429ae7 Che-Wei Chang 2021-10-22 Build site.
Rmd a0a9f40 Che-Wei Chang 2021-10-22 wflow_publish("./analysis/GenomeEnvAssociation.Rmd")
html 1202876 Che-Wei Chang 2021-10-22 Host with GitLab.
Rmd 7c6b41a Che-Wei Chang 2021-10-22 wflow_publish(list.files("./analysis", pattern = "*.Rmd", full.names = T))

Introduction

To identify genomic regions that potentially under environmental selection, we perform genome scan with genome-environment association (GEA) approaches and also an outlier method.

Here, we use three GEA approaches, including (1) simple RDA (2) partial RDA controlling population structure, and (3) LFMM (Caye et al. 2019). Two RDA approaches assumes that environmental selection gradients do not result from single environmental variables but a combination of environmental variables. In contrast, LFMM is a single-variate approach that consider one environmental variable at a time.

Genomic scan with RDA

To know more details of RDA approach, please check Lasky et al. (2012) and Forester et al. (2018). We use the statistical framework proposed by Capblancq et al. (2018) to carry out statistical tests for RDA.

Briefly, in the framework of Capblancq et al. (2018), we first select the most informative K axes of the RDA according to the plot of the explained variation with the "elbow" method. Subsequently, Mahalanobis distance is calculated for each SNP locus (adaptive SNPs are expected to diverge from the majority of SNPs, so they should have large Mahalanobis distances). To obtain p-values, Mahalanobis distances are divided by the genomic inflation factor to approximate a chi-squared distribution with K degree of freedom (see Luu et al. 2017). The genomic inflation factor is a constant defined as he median of the Mahalanobis distances divided by the median of the chi-square distribution with K degrees of freedom (Luu et al. 2017). Finally, statistical tests can be carried out based on the chi-squared distribution with K degree of freedom.

About XtX and BAYPASS

For the outlier method, we calculated XtX statistic (Günther and Coop 2013) that accounts for the covariance between populations by using BAYPASS (Gautier 2015). The significant threshold was determined with the 99.5% quantile of pseudo-observed data (Gautier 2015).

Load data for GEA

rm(list = ls())
source("./code/Rfunctions.R")

library(vcfR)
library(vegan)
 
# load genotypic data (244 accessions with 27,147 SNPs imputed by Beagle 5 with defaults)
geno <- read.vcfR("./data/GBS_244accessions_27147SNP_Beagle5Imp.vcf.gz", verbose = F)
Y <- t(vcf_to_nummatrix(geno))
colnames(Y) <- geno@fix[,"ID"]


CHR <- as.numeric(geno@fix[,"CHROM"])
POS <- as.numeric(geno@fix[,"POS"])

# load environmental data (244 accessions with 12 environmental variables)
env <- read.delim("./data/Env_data_244accessions_12Env_variables.tsv", header = T)
predictor <- as.data.frame(scale(env)) # standardize data

# load ancestry coefficients (K=4) estimated with ALStructure (https://doi.org/10.1534/genetics.119.302159)
alsQ <- t(readRDS("./data/Ancestry_coeff_244accessions_ALStructure_K4.RDS")$Q_hat)


# perform PCA
vcf <- read.vcfR("./data/GBS_244accessions_LDpruned_19601SNP.vcf.gz", verbose = F) # this vcf is LD prunned and unimputed, which is for population structure analysis.
Xmat <- vcf_to_nummatrix(vcf = vcf)
impX <- t(apply(Xmat,1,function(x){
  x[is.na(x)] <- mean(x, na.rm = T) 
  return(x)
}))

barley.pc <- prcomp(t(impX))

RDA approaches

Simple RDA

rda.snp.env <- rda(as.formula(paste("Y ~ ",paste(colnames(predictor), collapse = " + "), sep = "")), data = predictor)

Partial RDA controlling population structure

rda.snp.env.alsQ <- rda(Y ~ as.matrix(predictor) + Condition(as.matrix(alsQ)), scale = F)
PCs <- barley.pc$x[,1:3] # choose the first three PC axes to represent population structure
rda.snp.env.pc <- rda(Y ~ as.matrix(predictor) + Condition(as.matrix(PCs)), scale = F)

Selection of K

To determine the number of RDA axes for calculating Mahalanobis distance, we make a plot showing the proportion of explained variation of each axis. As there is no very conspicuous "elbow" in the plot, I choose K=4, which is the least K in the interval where "elbow" appears in the curve of the simple RDA.

library(ggplot2)
dat <- data.frame("K"=as.factor(rep(1:12, times = 3)),"Exp_prop" = c(summary(rda.snp.env)$concont[[1]][2,], summary(rda.snp.env.alsQ)$concont[[1]][2,], summary(rda.snp.env.pc)$concont[[1]][2,]
                                                                     ), "model" = rep(c("Simple_RDA", "RDA_Als", "RDA_PC"), each = 12))

ggplot(dat, aes(x=K, y=Exp_prop, group=model)) +
  geom_line(aes(linetype=model)) + 
  geom_point() +
  ylab("Explained proportion")

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
bc726c2 Che-Wei Chang 2021-10-22

Statistical test

To conduct statistical tests, we use the R function rdadapt (see github.com/Capblancq/RDA-genome-scan). This function returns both p-values and q-values.

# load libraries required for rdadapt
library(robust)
library(qvalue)

k = 4 # number of RDA axes to retain
rda.simple.pq <- rdadapt(rda.snp.env, K = k)
rda.partial.alsQ.pq <- rdadapt(rda.snp.env.alsQ, K = k)
rda.partial.pc.pq <- rdadapt(rda.snp.env.pc, K = k)

Differences between using ancestry coefficients and PCs

As we use two approaches to condition an RDA model on population structure (ancestry coefficients and PC scores), it is interesting to look at how different between the results of them.

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
bc726c2 Che-Wei Chang 2021-10-22

We further zoom-in to those SNPs with q-values < 0.05.

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
bc726c2 Che-Wei Chang 2021-10-22

Here we check how many SNPs are significant in a partial RDA model but not significant in another.

# the number of SNPs significant in the RDA conditioned on ancestry coefficients but not on PCs
sum(rda.partial.alsQ.pq$q.values <= 0.05 & rda.partial.pc.pq$q.values > 0.05)
[1] 7
# the number of SNPs significant in the RDA conditioned on PCs but not on ancestry coefficients
sum(rda.partial.alsQ.pq$q.values > 0.05 & rda.partial.pc.pq$q.values <= 0.05)
[1] 19

By comparing the results of two approaches, we confirmed there is only very little difference between conditioned on ancestry coefficients and PC scores. A possible explanation for this is that ALStructure is a method bridging PCA with the model-based population structure analysis (like STRUCTURE and ADMIXTURE). For the downstream analysis, we would only show the results of RDA conditioned on ancestry coefficients.

We take FDR = 0.05 as a significance level for statistical tests.

Simple RDA

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Partial RDA controlling population structure

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

LFMM

library(lfmm)
fit.lfmm <- lfmm_ridge(Y = Y, X = predictor, K = 4)

pv <- lfmm_test(Y = Y, X = predictor, lfmm = fit.lfmm, 
                calibrate = "gif")

pvalues <- pv$calibrated.pvalue 

# check if the distribution of p-value agree with the FDR assumption
hist(as.vector(pvalues), xlab = "Adjusted P-values", main = "Distribution of Adjusted P-values of LFMM (K = 4)")

Version Author Date
1202876 Che-Wei Chang 2021-10-22
# calculate q-value for each GEA test respectively
qvalues <- 
  apply(pvalues,2, function(x){
    qvalue(x)$qvalues
  })

rownames(qvalues) <- rownames(pvalues)
colnames(qvalues) <- colnames(pvalues)

Number of significant SNPs

For all environmental variables

There are 307 significant SNPs in total for all environmental variables.

sum(apply(qvalues, 1, function(x){any(x < 0.05)})) 
[1] 307

For each environmental variable

We further look at the number of significant SNPs for each environmental variable.

Env Number_of_SNPs
Latitude.Rain.Solar_rad 4
Aspect 0
Slope 0
Elevation.Temperature 108
Soil_bulk_density 0
Soil_water_capacity 1
Soil_carbon_content.0.15cm. 11
Soil_carbon_content.30cm. 0
Soil_pH 2
Soil_silt_content 0
StdDev_Temperature 179
CoefVar_Rain 12

Manhattan plots of LFMM

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30
bc726c2 Che-Wei Chang 2021-10-22

BAYPASS

Preparation of genotypic data

The data of allele counts was prepared with a customized R function vcf_to_baypass_v2. The B1K+ accessions were assigned to four populations based on their largest ancestry coefficients calculated by the ALStructure with the optimal K (K=4).

Run BAYPASS

The parameter setting followed a recent study using baypass (Leblois et al. 2017). The developer of the baypass, Dr. Mathieu Gautier, is the coauthor of this study (Leblois et al. 2017), so I assumed their setting for MCMC chains of baypassshould be more suitable than the default setting.

Specifically, we used the settings below for BAYPASS:

    1. -npilot 25: 25 short pilot runs
    1. -pilotlength 1000: 1,000 iterations for each pilot runs
    1. -burnin 100000: 100,000 iterations of burn-in
    1. -nval 2500: recording 2,500 post burn-in and thinned samples
    1. -thin 40: thinning by 40 (so the total post burn-in iterations would be 100,000 = 40 * 2,500)
  • run baypass with bash:
    g_baypass -npop 4 -gfile BayPass_genotyping_input_191B1K+53KS_in4pops_27147SNPs_maf0.05.alct -outprefix baypass_output_191B1K+53KS_in4pops -nthreads 16 -npilot 25 -pilotlength 1000 -burnin 100000 -thin 40 -nval 2500

Using POD (pseudo-observed data) to determine threshold

  • Using the following R codes to generate POD file
# create POD
source("/usr/local/bin/baypass_utils.R")
omega=as.matrix(read.table("baypass_output_191B1K+53KS_in4pops_mat_omega.out"))
pi.beta.coef = read.table("baypass_output_191B1K+53KS_in4pops_summary_beta_params.out",h=T)$Mean
wildbarley.data <- geno2YN("BayPass_genotyping_input_191B1K+53KS_in4pops_27147SNPs_maf0.05.alct")
simu.bta<-simulate.baypass(omega.mat=omega,nsnp=10000,sample.size = wildbarley.data$NN,
                           beta.pi = pi.beta.coef,pi.maf=0,suffix="wildbarley_pods")

# calculate XtX using POD
system("g_baypass -npop 4 -gfile G.wildbarley_pods -outprefix wildbarley_POD")

Manhattan plots of BAYPASS

# load BAYPASS result
XtX <- read.table("./data/baypass_output_191B1K+53KS_in4pops_summary_pi_xtx.out", stringsAsFactors = F, header = T)
POD <- read.table("./data/wildbarley_POD_summary_pi_xtx.out", header = T)
pod.thres <- quantile(POD$M_XtX, 0.995)
sum(XtX$M_XtX>pod.thres)
[1] 279

Version Author Date
5f53139 Che-Wei Chang 2022-01-18
597a627 Che-Wei Chang 2021-11-30

Significant SNPs identified by four methods

Below we summarized the number of significant SNPs with venn.diagram of the VennDiagram package.

qcut <- 0.05
rda.simple.sigtag <- rda.simple.pq[,2] < qcut
rda.ancescoeff.sigtag <- rda.partial.pq[,2] < qcut
lfmm.sigtag <- apply(lfmm.q,1, function(x){any(x < qcut)})
baypass.sigtag <- XtX$M_XtX > pod.thres

venn.input <- list("Simple RDA" = which(rda.simple.sigtag), "Partial RDA" = which(rda.ancescoeff.sigtag),
                   "LFMM" = which(lfmm.sigtag), "BAYPASS" = which(baypass.sigtag)
)

library(VennDiagram)
venn <- 
  venn.diagram(venn.input,fill = "white"
               , cex = 1.5,cat.fontfamily = "sans", lty =2,  imagetype = "png", 
               filename = NULL,
               cat.pos = c(0), resolution = 400, cat.cex = 2
  )


grid.draw(venn)

Version Author Date
597a627 Che-Wei Chang 2021-11-30

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VennDiagram_1.6.20  futile.logger_1.4.3 lfmm_0.0           
 [4] scales_1.1.1        qvalue_2.16.0       robust_0.4-18.2    
 [7] fit.models_0.5-14   ggplot2_3.3.5       vegan_2.5-6        
[10] lattice_0.20-38     permute_0.9-5       vcfR_1.9.0         
[13] workflowr_1.6.2    

loaded via a namespace (and not attached):
 [1] viridisLite_0.4.0    splines_3.6.0        foreach_1.5.1       
 [4] assertthat_0.2.1     memuse_4.0-0         highr_0.9           
 [7] stats4_3.6.0         yaml_2.2.1           robustbase_0.93-5   
[10] pillar_1.6.4         backports_1.1.5      glue_1.4.2          
[13] RcppEigen_0.3.3.9.1  digest_0.6.28        promises_1.2.0.1    
[16] colorspace_2.0-2     htmltools_0.5.2      httpuv_1.6.3        
[19] Matrix_1.2-18        plyr_1.8.6           pcaPP_1.9-73        
[22] pkgconfig_2.0.3      purrr_0.3.4          mvtnorm_1.0-12      
[25] whisker_0.4          RSpectra_0.16-0      later_1.3.0         
[28] git2r_0.26.1         tibble_3.1.5         mgcv_1.8-28         
[31] generics_0.1.0       farver_2.1.0         ellipsis_0.3.2      
[34] withr_2.4.2          magrittr_2.0.1       crayon_1.4.1        
[37] evaluate_0.14        fs_1.5.0             fansi_0.5.0         
[40] nlme_3.1-139         MASS_7.3-51.1        tools_3.6.0         
[43] formatR_1.7          lifecycle_1.0.1      stringr_1.4.0       
[46] munsell_0.5.0        cluster_2.0.8        lambda.r_1.2.4      
[49] compiler_3.6.0       rlang_0.4.12         iterators_1.0.13    
[52] labeling_0.4.2       rmarkdown_2.3        gtable_0.3.0        
[55] codetools_0.2-16     DBI_1.1.1            reshape2_1.4.3      
[58] rrcov_1.5-2          R6_2.5.1             knitr_1.36          
[61] dplyr_1.0.7          pinfsc50_1.1.0       fastmap_1.1.0       
[64] utf8_1.2.2           rprojroot_1.3-2      futile.options_1.0.1
[67] ape_5.3              stringi_1.7.5        parallel_3.6.0      
[70] Rcpp_1.0.7           vctrs_0.3.8          DEoptimR_1.0-8      
[73] tidyselect_1.1.1     xfun_0.27