2017-03-22_Tutorials_PopulGenomics4.md

P/BIO381 Tutorials

Population Genomics 4: Population Structure with PCA and ADMIXTURE

March 27, 2017 (revised after class to correct typos)

Our next goal is to look for the presence of population structure in our sample of sea stars. Recall that these animals were all collected from the same general geographic area, and the dispersal ability of sea star gametes and juvelines is pretty impressive. So, we don't necessarily expect to find a lot of structure, but one nevers knows without checking...

We'll take 2 different approaches to test if there is any population structure present in our sample:

Principal Component Analysis (PCA) and related analyses on the SNPs to see if they group by sampling locality or disease status
The maximum likelihood ADMIXTURE program to cluster genotypes into K groups, in which we'll vary K from 1 - 10

Keep in mind, both analyses are naive with regard to the actual sampling locality of individuals, so they provide a relatively unbiased way of determining if there are actually >1 genetically distinct groups represented in the data.

PCA on SNP genotypes:

Principal Components Analysis (PCA) is a powerful multivariate technique to reduce the dimensionality of large SNP datasets into a few synthetic axes (PC's) that describe the major structure present in the data. We'll do this in R using the adegent package (adegenet manual available here).

Transfer your filtered vcf.gz file (SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.gz) from the server to your local machine. You know the drill…use Fetch, WinScp, or scp at the command-line.
Also transfer the metadata on ssw locality and disease, found here:

/data/project_data/snps/reads2snps/ssw_healthloc.txt

Open R, paste the following into an R script, and work through it:

 
# Set your working directory to where you downloaded your results files:
setwd("~/github/PBIO381_srkeller_labnotebook/data/SNP_data/")
list.files() # Do you see your downloaded files there? If not, double check to make sure you've set your working directory to the right spot
# We'll need to install 2 packages to work with the SNP data:
install.packages("vcfR") # reads in vcf files and proides tools for file conversion 
install.packages("adegenet") # pop-genetics package with some handy routines, including PCA and other multivariate methods (DAPC)
# ...and load the libraries
library(vcfR)
library(adegenet)
#Read the vcf SNP data into R
vcf1 <- read.vcfR("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf")
# The adegenet package uses a highly efficient way of storing large SNP datasets in R called a "genlight" object. The following function creates a genlight object from your vcf:
gl1 <- vcfR2genlight(vcf1)
print(gl1) # Looks good! Right # of SNPs and individuals!
# For info, try:
gl1$ind.names
gl1$loc.names[1:10]
gl1$chromosome[1:3]
# Notice there's nothing in the field that says "pop"? Let's fix that...
ssw_meta <- read.table("ssw_healthloc.txt", header=T) # read in the metadata
ssw_meta <- ssw_meta[order(ssw_meta$Individual),] # sort by Individual ID, just like the VCF file
# Confirm the ID's are ordered the same in gl1 and ssw_meta:
gl1$ind.names
ssw_meta$Individual
gl1$pop <- ssw_meta$Location # assign locality info
# THIS IS THE LINE OF CODE THAT WAS CAUSING US ISSUES IN CLASS! HERE, I'VE CORRECTED IT TO ASSIGN ALL FIELDS IN THE META-DATA FOR 'ssw_meta' AS A LIST OF VARIABLES IN 'gl1$other'. FROM HERE ON, THE CODE SHOULD WORK FINE. 
gl1$other <- as.list(ssw_meta) # assign disease status
# WE can explore the structure of our SNP data using the glPlot function, which gives us a sample x SNP view of the VCF file
glPlot(gl1, posi="bottomleft")
# Now, let's compute the PCA on the SNP genotypes and plot it:
pca1 <- glPca(gl1, nf=4, parallel=F) # nf = number of PC axes to retain (here, 4)
pca1 # prints summary
# Plot the individuals in SNP-PCA space, with locality labels:
plot(pca1$scores[,1], pca1$scores[,2], 
     cex=2, pch=20, col=gl1$pop, 
     xlab="Principal Component 1", 
     ylab="Principal Component 2", 
     main="PCA on SSW data (Freq missing=20%; 5317 SNPs)")
legend("topleft", 
       legend=unique(gl1$pop), 
       pch=20, 
       col=c("black", "red"))
# Perhaps we want to show disease status instead of locality:
plot(pca1$scores[,1], pca1$scores[,2], 
     cex=2, pch=20, col=as.factor(gl1$other$Trajectory), 
     xlab="Principal Component 1", 
     ylab="Principal Component 2", 
     main="PCA on SSW data (Freq missing=20%; 5317 SNPs)")
legend("topleft", 
       legend=unique(gl1$other$Trajectory), 
       pch=20, 
       col=as.factor(unique(gl1$other$Trajectory)))
# Which SNPs load most strongly on the 1st PC axis?
loadingplot(abs(pca1$loadings[,1]),
            threshold=quantile(abs(pca1$loadings), 0.999))
# Get their locus names
gl1$loc.names[which(abs(pca1$loadings)>quantile(abs(pca1$loadings), 0.999))]

If you have a-priori defined groups, another way to analyze SNP-PCA information is with a discriminant analysis. This is known as Discriminant Analysis of Principal Components (DAPC), and is a very useful means of finding the SNPs that most differentiate your samples for a variable of interest. Read more on this method here.

For our data, we might choose to perform DAPC based on a-priori disease status designations...

 
# Run the DAPC using disease status to group samples
disease.dapc <- dapc(gl1, pop=gl1$other$Trajectory, n.pca=8, n.da=3,
     var.loadings=T, pca.info=T, parallel=F)
# Scatterplot of results
scatter.dapc(disease.dapc, grp=gl1$other$Trajectory, legend=T)
# Plot the posterior assignment probabilities to each group
compoplot(disease.dapc)
# Which loci contribute the most to distinguishing Healthy vs. Sick individuals?
loadingplot(abs(disease.dapc$var.load), 
            lab.jitter=1, 
            threshold=quantile(abs(disease.dapc$var.load), probs=0.999))

ADMIXTURE analysis

A second way of estimating population structure besides PCA is to use genotypic clustering algorithms. These include the familiar program STRUCTURE, as well as many others that have sprung up like it. All share the common feature of using multi-locus genetic data to estimate:

(i) the number of clusters present, and
(ii) each individual's proportion of genetic ancestry in these clusters

With large population genomic datasets, STRUCTURE would take a prohibitively long time to run. Thus, analyzing thousands to millions of SNPs requires computationally efficient approaches to the clustering problem. A good option is the maximum-likelihood program ADMIXTURE by John Novembre's lab.

For reference, here is the source page for information on ADMIXTURE.

And as with any good software, there is also a well annotated manual available.

ADMIXTURE introduces a user-defined number of groups or clusters (known as K) and uses maximum likelihood to estimate allele frequencies in each cluster, and assign each individual ancestry (Q) to one or more of these clusters.

To run ADMIXTURE, we need to provide an input file and the requested level of K to investigate. Unfortunately, getting the data formatted properly for input is a bit of a pain.

The program PGDSpider is able to convert vcf files to .geno format, which ADMIXTURE can read. This requires 4 files:

the input data file in vcf format
a text file with sample IDs and the population designations
a settings file (.spid) that tells PGDSpider how to process the data
a bash script that runs the program with all the above settings specified

Lucky for you, I've already done this! But, here are the files for future reference, in case you need to do this yourself down the road. Make sure they're all in the same directory along with your vcf file before running.

 
/data/project_data/snps/reads2snps/SSW_tidal.pops
/data/project_data/snps/reads2snps/vcf2admixture_SSW.spid
/data/project_data/snps/reads2snps/vcf2geno.sh

The ready-to-go geno file is located on our server here:

 
/data/project_data/snps/reads2snps/SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.geno

In the same path, you should also see a bash script:

 
/data/project_data/snps/reads2snps/ADMIX.sh

Use cp to copy the .geno and ADMIX.sh files to your home directory on the server, then cd there and confirm the files are present.

From within your home directory, open the ADMIX.sh script in vim. Let's walk through what each step is doing:

 
#!/bin/bash
# Run ADMIXTURE to determine the number of genetic clusters in the SNP data, 
# and the ancestry proportions of each individual
# Here's the utility of 'for loops'...
for K in {1..10}
do
admixture -C 0.000001 --cv ./SSW_all_biallelic.MAF0.02.Miss1.0.recode.vcf.geno $K \
| tee log${K}.out
done
# After the for loop finishes, you can use 'grep' to grab the values of the CV from each separate log file and append them into a new summary text file.
grep CV log*.out >chooseK.txt

When you're ready to go, exit vim to return to the command line, and execute the script.

 
$ bash ADMIX.sh

The cross-validation procedure in ADMIXTURE breaks the samples into 5 equally sized chunks. It then masks each chunk in turn, trains the model to estimate the allele frequencies and ancestry assignments on the unmasked data, and then attempts to predict the genotype values for the masked individuals.

If the model is good (and there's true structure in the data), then the best value of K is the one that will minimize the cross-validation (CV) error. This is shown in the example plot below (not our SSW data)

The CV values for our runs are stored in the output file "chooseK.txt"

Print the contents of this file to your screen:

 
$ cat chooseK.txt

What level of K is the CV the lowest?
What does this say about the presence of genetic structure in our SSW data?

We can check our estimates of individual ancestry and make admixture barplots in R.

Download the output Q files that ADMIXTURE generated to your laptop: Let's get the files corresponding to K=1-3 (*.1Q, *.2Q, *.3Q)
Simple plots can be made with this R script:

 
xxxxxxxxxx
setwd("~/github/PBIO381_srkeller_labnotebook/results")
# Import the ADMIXTURE Q matrices
K1Q <- read.table("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.1.Q")
K2Q <- read.table("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.2.Q")
K3Q <- read.table("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.3.Q")
# Get the SSW meta-data
ssw_meta <- read.table("~/github/PBIO381_srkeller_labnotebook/data/SNP_data/ssw_healthloc.txt", header=T)
  
# Set up the plotting conditions for a multi-panel plot (3 rows, 1 column)
par(mfrow=c(3,1))
# Make the barplots for K=1-3
barplot(t(as.matrix(K1Q)), 
        col=rainbow(2),
        names.arg=ssw_meta$Location, 
        cex.names=0.75, 
        xlab="Individual", ylab="Ancestry", 
        border=NA)
barplot(t(as.matrix(K2Q)), 
        col=rainbow(2),
        names.arg=ssw_meta$Location, 
        cex.names=0.75, 
        xlab="Individual", ylab="Ancestry", 
        border=NA)
barplot(t(as.matrix(K3Q)), 
        col=rainbow(3),
        names.arg=ssw_meta$Location, 
        cex.names=0.75, 
        xlab="Individual", ylab="Ancestry", 
        border=NA)