P/BIO381 Tutorials
As we've discussed in class, positive directional selection will increase the frequency of a beneficial SNP allele within a population. This often causes variation at nearby linked SNP loci to also increase in frequency along with the beneficial SNP due to LD — this is what we call a "selective sweep". There are several statistical "footprints" that selective sweeps are predicted to leave in SNP frequency data:
So far, we've been thinking about detecting sweeps within a single population. However, sweeps that are happening differently across popuations are expected to lead to an additional signal of high differentiation between groups (i.e., high Fst). This is the basis of our next analysis approach aimed at detecting evidence of selection operating differently between healthy and sick sea stars.
Fst measures the degree of allele frequency differences between 2 or more groups of individuals. In most cases, these groups represent different spatially distinct populations. However, population demographic history (drift, bottlenecks, range expansions) can have a profound effect on the distribution of Fst, and often leads to increases in the number of high Fst loci, leading to false positives in scans for selective sweeps. Therefore, we need to be careful to separate out loci with high Fst due to neutral demographic history from those with high Fst due to natural selection.
On such approach that we'll use here was recently published by Mike Whitlock and Katie Lotterhos, called 'OutFLANK'. The overall goal is to use the expected neutral distribution of Fst between n groups based on established population genetic theory to identify observed loci that exceed the upper neutral distribution — these are our candidates for local selection. However, the basic population genetic model used to generate Fst (the "Island Model") will fail under most realistic demographic scenarios that don't strictly follow the Island Model's assumptions of constant equal population sizes with equal and symmetric gene flow between groups.
OutFLANK's approach is to calculate the theoretical distribution of Fst that provides the best fit to just the neutral loci in our sample, given the (unknown) demographic history of our species. How does it do that? First, it trims out loci in the upper and lower tails of the distribution (these are most likely to be under selection). Then, it uses the remaining loci to fit a neutral distribution of Fst based on popgen theory. It then checks to see if this neutral distribution provides a good fit to the trimmed distribution. If it does, then it identifies outlier loci beyond the upper tail of the neutral distribution as candidates for local adaptation.
The manual for the OutFLANK R package can be found here.
For our analysis, our "populations" will consist of 3 groups of samples representing the disease status of sea stars: HH, HS, or SS (we'll exclude the ambiguous MM individuals here).
The data format to read into OutFLANK is a bit funny. It is very similar to the .geno file we generated for our ADMIXTURE analysis, where SNP genotypes are scored as 0,1,2 (missing =9). The difference is that OutFLANK expects this data matrix inverted, with individuals in rows and SNPs in columns. For us, this will be a 22 x 5317 matrix (22 because we're not including the MM's).
OK, let's get started:
# Set your working directory to where you downloaded your results files:
setwd("/Volumes/kellrlab/karl/PBIO_381/")
list.files() # Do you see your downloaded files there? If not, double check to make sure you've set your working directory to the right spot
install.packages("devtools")
library(devtools)
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
install_github("whitlock/OutFLANK")
# ...and load the library. We'll also load the vcfR and adegenet libraries that we'll need later in the tutorial
library(OutFLANK)
library(vcfR)
library(adegenet)
ssw.geno_in <- read.fwf("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.geno", width=rep(1,24))
ssw.geno <- t(ssw.geno_in)
ssw_meta <- read.table("ssw_healthloc.txt", header=T) # Bring in the metadata
ssw_meta <- ssw_meta[order(ssw_meta$Individual),] # Reorder the metadata by Individual Number
ssw_meta$Trajectory[which(ssw_meta$Trajectory=="MM")] = NA # Remove the MM's from the file
Now we're ready to run OutFLANK. This consists of the following steps:
OF_SNPs <- MakeDiploidFSTMat(ssw.geno, locusNames=seq(1, 5317, by=1), popNames=ssw_meta$Trajectory) # Provide the transposed genotype file, locus names, and population names.
OF_out <- OutFLANK(FstDataFrame=OF_SNPs, LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=3, qthreshold=0.1)
OutFLANKResultsPlotter(OF_out, withOutliers=T, NoCorr=T, Hmin=0.1, binwidth=0.005, titletext="Scan for local selective sweeps among disease groups: HH, HS, and SS")
outliers <- which(OF_out$results$OutlierFlag=="TRUE") print(outliers)
vcf1 <- read.vcfR("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf")
vcfann <- as.data.frame(getFIX(vcf1)) vcfann[outliers,]
Cool…it looks like we've got several promising candidate genes that show stronger than expected allele frequency divergence among our disease groups, consistent with selection imposed by SSW disease.
Coming up in our next coding session, we'll annotate our entire transcriptome using BLAST to public databases so we can assess what types of genes these are and what functions they have.
For now, we can do this manually by BLAST'ing a couple genes that we've identified as outliers:
Log back into the server and download the file that contains the full transcriptome reference sequences based on our RNASeq data:
/data/project_data/assembly/08-11-35-36_cl20_longest_orfs_gene.cds
Open this file in your favorite light-weight text editor and search for one of the Trinity transcript ID's discovered by OutFLANK
Copy the resulting nucleotide sequences, and go to the NCBI BLAST homepage: https://blast.ncbi.nlm.nih.gov/Blast.cgi
Since these sequences are protein-coding transcripts, let's use the BLASTn algorithm, which translates the nucleotide sequence into amino-acids, and searches the NCBI nr database for protein matches
What did you find? What type of protein and its function are among the best hits?
It might be really interesting to do this for the outlier SNPs that you identified among disease groups using DAPC. Think back to the loading plots we made in the PopGenomics4 tutorial:
# Use the "droplevels" command to eliminate the last 2 individuals with "MM" and ensure there are only 3 disease categories; call this a new dataframe 'ssw_meta2'
ssw_meta2 <- droplevels(ssw_meta[1:22,])
# Now drop the last 2 individuals that are MM from the genlight SNP data. Note, the vcf object has an extra column in front, so you have to go to column 23 to get the first 22 individuals.
gl1 <- vcfR2genlight(vcf1[,1:23])
gl1$other <- as.list(ssw_meta2)
# Run the DAPC using disease status to group samples
disease.dapc <- dapc(gl1, pop=gl1$other$Trajectory, n.pca=7, n.da=3,
var.loadings=T, pca.info=T, parallel=F)
# Scatterplot of results
scatter.dapc(disease.dapc, grp=gl1$other$Trajectory, legend=T)
# Plot loci that contribute the most to distinguishing Healthy vs. Sick individuals (upper 0.1% of loadings)
loadingplot(disease.dapc$var.load,
lab.jitter=1,
threshold=quantile(disease.dapc$var.load, probs=0.999))
# What are the Trinity transcript ID's for these DAPC outliers?
gl1$chromosome[which(disease.dapc$var.load>quantile(disease.dapc$var.load, 0.999))]
Do any of the DAPC outliers overlap with the selected SNPs identified by OutFLANK?
What type of genes do you get for DAPC outliers when you use BLASTn?