Last updated: 2022-01-20

Checks: 6 1

Knit directory: Barley1kGBS_Proj/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you'll want to first commit it to the Git repo. If you're still working on the analysis, you can ignore this warning. When you're finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

Seed: set.seed(20210831)

The command set.seed(20210831) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: b34d099

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b34d099. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/GenomicVarPartitioning_cache/
    Ignored:    analysis/PreprocessEnvData_cache/

Unstaged changes:
    Modified:   analysis/GenomeEnvAssociation.Rmd
    Modified:   analysis/PreprocessEnvData.Rmd
    Modified:   analysis/ResistanceSurface.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/GenomeEnvAssociation.Rmd) and HTML (public/GenomeEnvAssociation.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	5e688f9	Che-Wei Chang	2022-01-18	Build site.
html	b38dcf2	Che-Wei Chang	2022-01-18	Build site.
html	5f53139	Che-Wei Chang	2022-01-18	Build site.
Rmd	597a627	Che-Wei Chang	2021-11-30	update analyses. unPC, ResistanceGA and GO enrichment
html	597a627	Che-Wei Chang	2021-11-30	update analyses. unPC, ResistanceGA and GO enrichment
html	bc726c2	Che-Wei Chang	2021-10-22	Build site.
html	9429ae7	Che-Wei Chang	2021-10-22	Build site.
Rmd	a0a9f40	Che-Wei Chang	2021-10-22	wflow_publish("./analysis/GenomeEnvAssociation.Rmd")
html	1202876	Che-Wei Chang	2021-10-22	Host with GitLab.
Rmd	7c6b41a	Che-Wei Chang	2021-10-22	wflow_publish(list.files("./analysis", pattern = "*.Rmd", full.names = T))

Introduction

To identify genomic regions that potentially under environmental selection, we perform genome scan with genome-environment association (GEA) approaches and also an outlier method.

Here, we use three GEA approaches, including (1) simple RDA (2) partial RDA controlling population structure, and (3) LFMM (Caye et al. 2019). Two RDA approaches assumes that environmental selection gradients do not result from single environmental variables but a combination of environmental variables. In contrast, LFMM is a single-variate approach that consider one environmental variable at a time.

Genomic scan with RDA

To know more details of RDA approach, please check Lasky et al. (2012) and Forester et al. (2018). We use the statistical framework proposed by Capblancq et al. (2018) to carry out statistical tests for RDA.

Briefly, in the framework of Capblancq et al. (2018), we first select the most informative K axes of the RDA according to the plot of the explained variation with the "elbow" method. Subsequently, Mahalanobis distance is calculated for each SNP locus (adaptive SNPs are expected to diverge from the majority of SNPs, so they should have large Mahalanobis distances). To obtain p-values, Mahalanobis distances are divided by the genomic inflation factor to approximate a chi-squared distribution with K degree of freedom (see Luu et al. 2017). The genomic inflation factor is a constant defined as he median of the Mahalanobis distances divided by the median of the chi-square distribution with K degrees of freedom (Luu et al. 2017). Finally, statistical tests can be carried out based on the chi-squared distribution with K degree of freedom.

About `XtX` and `BAYPASS`

For the outlier method, we calculated XtX statistic (Günther and Coop 2013) that accounts for the covariance between populations by using BAYPASS (Gautier 2015). The significant threshold was determined with the 99.5% quantile of pseudo-observed data (Gautier 2015).

Load data for GEA

rm(list = ls())
source("./code/Rfunctions.R")

library(vcfR)
library(vegan)
 
# load genotypic data (244 accessions with 27,147 SNPs imputed by Beagle 5 with defaults)
geno <- read.vcfR("./data/GBS_244accessions_27147SNP_Beagle5Imp.vcf.gz", verbose = F)
Y <- t(vcf_to_nummatrix(geno))
colnames(Y) <- geno@fix[,"ID"]


CHR <- as.numeric(geno@fix[,"CHROM"])
POS <- as.numeric(geno@fix[,"POS"])

# load environmental data (244 accessions with 12 environmental variables)
env <- read.delim("./data/Env_data_244accessions_12Env_variables.tsv", header = T)
predictor <- as.data.frame(scale(env)) # standardize data

# load ancestry coefficients (K=4) estimated with ALStructure (https://doi.org/10.1534/genetics.119.302159)
alsQ <- t(readRDS("./data/Ancestry_coeff_244accessions_ALStructure_K4.RDS")$Q_hat)


# perform PCA
vcf <- read.vcfR("./data/GBS_244accessions_LDpruned_19601SNP.vcf.gz", verbose = F) # this vcf is LD prunned and unimputed, which is for population structure analysis.
Xmat <- vcf_to_nummatrix(vcf = vcf)
impX <- t(apply(Xmat,1,function(x){
  x[is.na(x)] <- mean(x, na.rm = T) 
  return(x)
}))

barley.pc <- prcomp(t(impX))

RDA approaches

Simple RDA

rda.snp.env <- rda(as.formula(paste("Y ~ ",paste(colnames(predictor), collapse = " + "), sep = "")), data = predictor)

Partial RDA controlling population structure

rda.snp.env.alsQ <- rda(Y ~ as.matrix(predictor) + Condition(as.matrix(alsQ)), scale = F)

PCs <- barley.pc$x[,1:3] # choose the first three PC axes to represent population structure
rda.snp.env.pc <- rda(Y ~ as.matrix(predictor) + Condition(as.matrix(PCs)), scale = F)

Selection of K

To determine the number of RDA axes for calculating Mahalanobis distance, we make a plot showing the proportion of explained variation of each axis. As there is no very conspicuous "elbow" in the plot, I choose K=4, which is the least K in the interval where "elbow" appears in the curve of the simple RDA.

library(ggplot2)
dat <- data.frame("K"=as.factor(rep(1:12, times = 3)),"Exp_prop" = c(summary(rda.snp.env)$concont[[1]][2,], summary(rda.snp.env.alsQ)$concont[[1]][2,], summary(rda.snp.env.pc)$concont[[1]][2,]
                                                                     ), "model" = rep(c("Simple_RDA", "RDA_Als", "RDA_PC"), each = 12))

ggplot(dat, aes(x=K, y=Exp_prop, group=model)) +
  geom_line(aes(linetype=model)) + 
  geom_point() +
  ylab("Explained proportion")

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
bc726c2	Che-Wei Chang	2021-10-22

Statistical test

To conduct statistical tests, we use the R function rdadapt (see github.com/Capblancq/RDA-genome-scan). This function returns both p-values and q-values.

# load libraries required for rdadapt
library(robust)
library(qvalue)

k = 4 # number of RDA axes to retain
rda.simple.pq <- rdadapt(rda.snp.env, K = k)
rda.partial.alsQ.pq <- rdadapt(rda.snp.env.alsQ, K = k)
rda.partial.pc.pq <- rdadapt(rda.snp.env.pc, K = k)

Differences between using ancestry coefficients and PCs

As we use two approaches to condition an RDA model on population structure (ancestry coefficients and PC scores), it is interesting to look at how different between the results of them.

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
bc726c2	Che-Wei Chang	2021-10-22

We further zoom-in to those SNPs with q-values < 0.05.

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
bc726c2	Che-Wei Chang	2021-10-22

Here we check how many SNPs are significant in a partial RDA model but not significant in another.

# the number of SNPs significant in the RDA conditioned on ancestry coefficients but not on PCs
sum(rda.partial.alsQ.pq$q.values <= 0.05 & rda.partial.pc.pq$q.values > 0.05)

[1] 7

# the number of SNPs significant in the RDA conditioned on PCs but not on ancestry coefficients
sum(rda.partial.alsQ.pq$q.values > 0.05 & rda.partial.pc.pq$q.values <= 0.05)

[1] 19

By comparing the results of two approaches, we confirmed there is only very little difference between conditioned on ancestry coefficients and PC scores. A possible explanation for this is that ALStructure is a method bridging PCA with the model-based population structure analysis (like STRUCTURE and ADMIXTURE). For the downstream analysis, we would only show the results of RDA conditioned on ancestry coefficients.

We take FDR = 0.05 as a significance level for statistical tests.

Simple RDA

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Partial RDA controlling population structure

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

LFMM

library(lfmm)
fit.lfmm <- lfmm_ridge(Y = Y, X = predictor, K = 4)

pv <- lfmm_test(Y = Y, X = predictor, lfmm = fit.lfmm, 
                calibrate = "gif")

pvalues <- pv$calibrated.pvalue 

# check if the distribution of p-value agree with the FDR assumption
hist(as.vector(pvalues), xlab = "Adjusted P-values", main = "Distribution of Adjusted P-values of LFMM (K = 4)")

Version	Author	Date
1202876	Che-Wei Chang	2021-10-22

# calculate q-value for each GEA test respectively
qvalues <- 
  apply(pvalues,2, function(x){
    qvalue(x)$qvalues
  })

rownames(qvalues) <- rownames(pvalues)
colnames(qvalues) <- colnames(pvalues)

Number of significant SNPs

For all environmental variables

There are 307 significant SNPs in total for all environmental variables.

sum(apply(qvalues, 1, function(x){any(x < 0.05)}))

[1] 307

For each environmental variable

We further look at the number of significant SNPs for each environmental variable.

Env	Number_of_SNPs
Latitude.Rain.Solar_rad	4
Aspect	0
Slope	0
Elevation.Temperature	108
Soil_bulk_density	0
Soil_water_capacity	1
Soil_carbon_content.0.15cm.	11
Soil_carbon_content.30cm.	0
Soil_pH	2
Soil_silt_content	0
StdDev_Temperature	179
CoefVar_Rain	12

Manhattan plots of LFMM

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30
bc726c2	Che-Wei Chang	2021-10-22

BAYPASS

Preparation of genotypic data

The data of allele counts was prepared with a customized R function vcf_to_baypass_v2. The B1K+ accessions were assigned to four populations based on their largest ancestry coefficients calculated by the ALStructure with the optimal K (K=4).

Run `BAYPASS`

The parameter setting followed a recent study using baypass (Leblois et al. 2017). The developer of the baypass, Dr. Mathieu Gautier, is the coauthor of this study (Leblois et al. 2017), so I assumed their setting for MCMC chains of baypassshould be more suitable than the default setting.

Specifically, we used the settings below for BAYPASS:

1. -npilot 25: 25 short pilot runs
1. -pilotlength 1000: 1,000 iterations for each pilot runs
1. -burnin 100000: 100,000 iterations of burn-in
1. -nval 2500: recording 2,500 post burn-in and thinned samples
1. -thin 40: thinning by 40 (so the total post burn-in iterations would be 100,000 = 40 * 2,500)
run baypass with bash:
g_baypass -npop 4 -gfile BayPass_genotyping_input_191B1K+53KS_in4pops_27147SNPs_maf0.05.alct -outprefix baypass_output_191B1K+53KS_in4pops -nthreads 16 -npilot 25 -pilotlength 1000 -burnin 100000 -thin 40 -nval 2500

Using POD (pseudo-observed data) to determine threshold

Using the following R codes to generate POD file

# create POD
source("/usr/local/bin/baypass_utils.R")
omega=as.matrix(read.table("baypass_output_191B1K+53KS_in4pops_mat_omega.out"))
pi.beta.coef = read.table("baypass_output_191B1K+53KS_in4pops_summary_beta_params.out",h=T)$Mean
wildbarley.data <- geno2YN("BayPass_genotyping_input_191B1K+53KS_in4pops_27147SNPs_maf0.05.alct")
simu.bta<-simulate.baypass(omega.mat=omega,nsnp=10000,sample.size = wildbarley.data$NN,
                           beta.pi = pi.beta.coef,pi.maf=0,suffix="wildbarley_pods")

# calculate XtX using POD
system("g_baypass -npop 4 -gfile G.wildbarley_pods -outprefix wildbarley_POD")

Manhattan plots of `BAYPASS`

# load BAYPASS result
XtX <- read.table("./data/baypass_output_191B1K+53KS_in4pops_summary_pi_xtx.out", stringsAsFactors = F, header = T)
POD <- read.table("./data/wildbarley_POD_summary_pi_xtx.out", header = T)
pod.thres <- quantile(POD$M_XtX, 0.995)
sum(XtX$M_XtX>pod.thres)

[1] 279

Version	Author	Date
5f53139	Che-Wei Chang	2022-01-18
597a627	Che-Wei Chang	2021-11-30

Significant SNPs identified by four methods

Below we summarized the number of significant SNPs with venn.diagram of the VennDiagram package.

qcut <- 0.05
rda.simple.sigtag <- rda.simple.pq[,2] < qcut
rda.ancescoeff.sigtag <- rda.partial.pq[,2] < qcut
lfmm.sigtag <- apply(lfmm.q,1, function(x){any(x < qcut)})
baypass.sigtag <- XtX$M_XtX > pod.thres

venn.input <- list("Simple RDA" = which(rda.simple.sigtag), "Partial RDA" = which(rda.ancescoeff.sigtag),
                   "LFMM" = which(lfmm.sigtag), "BAYPASS" = which(baypass.sigtag)
)

library(VennDiagram)
venn <- 
  venn.diagram(venn.input,fill = "white"
               , cex = 1.5,cat.fontfamily = "sans", lty =2,  imagetype = "png", 
               filename = NULL,
               cat.pos = c(0), resolution = 400, cat.cex = 2
  )


grid.draw(venn)

Version	Author	Date
597a627	Che-Wei Chang	2021-11-30

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VennDiagram_1.6.20  futile.logger_1.4.3 lfmm_0.0           
 [4] scales_1.1.1        qvalue_2.16.0       robust_0.4-18.2    
 [7] fit.models_0.5-14   ggplot2_3.3.5       vegan_2.5-6        
[10] lattice_0.20-38     permute_0.9-5       vcfR_1.9.0         
[13] workflowr_1.6.2    

loaded via a namespace (and not attached):
 [1] viridisLite_0.4.0    splines_3.6.0        foreach_1.5.1       
 [4] assertthat_0.2.1     memuse_4.0-0         highr_0.9           
 [7] stats4_3.6.0         yaml_2.2.1           robustbase_0.93-5   
[10] pillar_1.6.4         backports_1.1.5      glue_1.4.2          
[13] RcppEigen_0.3.3.9.1  digest_0.6.28        promises_1.2.0.1    
[16] colorspace_2.0-2     htmltools_0.5.2      httpuv_1.6.3        
[19] Matrix_1.2-18        plyr_1.8.6           pcaPP_1.9-73        
[22] pkgconfig_2.0.3      purrr_0.3.4          mvtnorm_1.0-12      
[25] whisker_0.4          RSpectra_0.16-0      later_1.3.0         
[28] git2r_0.26.1         tibble_3.1.5         mgcv_1.8-28         
[31] generics_0.1.0       farver_2.1.0         ellipsis_0.3.2      
[34] withr_2.4.2          magrittr_2.0.1       crayon_1.4.1        
[37] evaluate_0.14        fs_1.5.0             fansi_0.5.0         
[40] nlme_3.1-139         MASS_7.3-51.1        tools_3.6.0         
[43] formatR_1.7          lifecycle_1.0.1      stringr_1.4.0       
[46] munsell_0.5.0        cluster_2.0.8        lambda.r_1.2.4      
[49] compiler_3.6.0       rlang_0.4.12         iterators_1.0.13    
[52] labeling_0.4.2       rmarkdown_2.3        gtable_0.3.0        
[55] codetools_0.2-16     DBI_1.1.1            reshape2_1.4.3      
[58] rrcov_1.5-2          R6_2.5.1             knitr_1.36          
[61] dplyr_1.0.7          pinfsc50_1.1.0       fastmap_1.1.0       
[64] utf8_1.2.2           rprojroot_1.3-2      futile.options_1.0.1
[67] ape_5.3              stringi_1.7.5        parallel_3.6.0      
[70] Rcpp_1.0.7           vctrs_0.3.8          DEoptimR_1.0-8      
[73] tidyselect_1.1.1     xfun_0.27

Genome Scans: XtX and Genome-Environment Association

Che-Wei Chang

2021-10-05

Introduction

Genomic scan with RDA

About `XtX` and `BAYPASS`

Load data for GEA

RDA approaches

Simple RDA

Partial RDA controlling population structure

Selection of K

Statistical test

Differences between using ancestry coefficients and PCs

Simple RDA

Partial RDA controlling population structure

LFMM

Number of significant SNPs

For all environmental variables

For each environmental variable

Manhattan plots of LFMM

BAYPASS

Preparation of genotypic data

Run `BAYPASS`

Using POD (pseudo-observed data) to determine threshold

Manhattan plots of `BAYPASS`

Significant SNPs identified by four methods

Genome Scans: XtX and Genome-Environment Association

Che-Wei Chang

2021-10-05

Introduction

Genomic scan with RDA

About XtX and BAYPASS

Load data for GEA

RDA approaches

Simple RDA

Partial RDA controlling population structure

Selection of K

Statistical test

Differences between using ancestry coefficients and PCs

Simple RDA

Partial RDA controlling population structure

LFMM

Number of significant SNPs

For all environmental variables

For each environmental variable

Manhattan plots of LFMM

BAYPASS

Preparation of genotypic data

Run BAYPASS

Using POD (pseudo-observed data) to determine threshold

Manhattan plots of BAYPASS

Significant SNPs identified by four methods

About `XtX` and `BAYPASS`

Run `BAYPASS`

Manhattan plots of `BAYPASS`