P/BIO381 Tutorials

Population Genomics 5: Testing for selective sweeps associated with disease status

April 3, 2017

As we've discussed in class, positive directional selection will increase the frequency of a beneficial SNP allele within a population. This often causes variation at nearby linked SNP loci to also increase in frequency along with the beneficial SNP due to LD — this is what we call a "selective sweep". There are several statistical "footprints" that selective sweeps are predicted to leave in SNP frequency data:

So far, we've been thinking about detecting sweeps within a single population. However, sweeps that are happening differently across popuations are expected to lead to an additional signal of high differentiation between groups (i.e., high Fst). This is the basis of our next analysis approach aimed at detecting evidence of selection operating differently between healthy and sick sea stars.


Challenges for finding Fst outliers:

Fst measures the degree of allele frequency differences between 2 or more groups of individuals. In most cases, these groups represent different spatially distinct populations. However, population demographic history (drift, bottlenecks, range expansions) can have a profound effect on the distribution of Fst, and often leads to increases in the number of high Fst loci, leading to false positives in scans for selective sweeps. Therefore, we need to be careful to separate out loci with high Fst due to neutral demographic history from those with high Fst due to natural selection.

On such approach that we'll use here was recently published by Mike Whitlock and Katie Lotterhos, called 'OutFLANK'. The overall goal is to use the expected neutral distribution of Fst between n groups based on established population genetic theory to identify observed loci that exceed the upper neutral distribution — these are our candidates for local selection. However, the basic population genetic model used to generate Fst (the "Island Model") will fail under most realistic demographic scenarios that don't strictly follow the Island Model's assumptions of constant equal population sizes with equal and symmetric gene flow between groups.

OutFLANK's approach is to calculate the theoretical distribution of Fst that provides the best fit to just the neutral loci in our sample, given the (unknown) demographic history of our species. How does it do that? First, it trims out loci in the upper and lower tails of the distribution (these are most likely to be under selection). Then, it uses the remaining loci to fit a neutral distribution of Fst based on popgen theory. It then checks to see if this neutral distribution provides a good fit to the trimmed distribution. If it does, then it identifies outlier loci beyond the upper tail of the neutral distribution as candidates for local adaptation.

The manual for the OutFLANK R package can be found here.

For our analysis, our "populations" will consist of 3 groups of samples representing the disease status of sea stars: HH, HS, or SS (we'll exclude the ambiguous MM individuals here).

The data format to read into OutFLANK is a bit funny. It is very similar to the .geno file we generated for our ADMIXTURE analysis, where SNP genotypes are scored as 0,1,2 (missing =9). The difference is that OutFLANK expects this data matrix inverted, with individuals in rows and SNPs in columns. For us, this will be a 22 x 5317 matrix (22 because we're not including the MM's).

OK, let's get started:

 
# Set your working directory to where you downloaded your results files:
setwd("/Volumes/kellrlab/karl/PBIO_381/")
list.files() # Do you see your downloaded files there? If not, double check to make sure you've set your working directory to the right spot
install.packages("devtools")
library(devtools)
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
install_github("whitlock/OutFLANK")
# ...and load the library. We'll also load the vcfR and adegenet libraries that we'll need later in the tutorial
library(OutFLANK)
library(vcfR)
library(adegenet)
 
ssw.geno_in <- read.fwf("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf.geno", width=rep(1,24))
ssw.geno <- t(ssw.geno_in)
ssw_meta <- read.table("ssw_healthloc.txt", header=T) # Bring in the metadata
ssw_meta <- ssw_meta[order(ssw_meta$Individual),] # Reorder the metadata by Individual Number
ssw_meta$Trajectory[which(ssw_meta$Trajectory=="MM")] = NA # Remove the MM's from the file
 
OF_SNPs <- MakeDiploidFSTMat(ssw.geno, locusNames=seq(1, 5317, by=1), popNames=ssw_meta$Trajectory) # Provide the transposed genotype file, locus names, and population names.
OF_out <- OutFLANK(FstDataFrame=OF_SNPs, LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=3, qthreshold=0.1)
OutFLANKResultsPlotter(OF_out, withOutliers=T, NoCorr=T, Hmin=0.1, binwidth=0.005, titletext="Scan for local selective sweeps among disease groups: HH, HS, and SS")
outliers <- which(OF_out$results$OutlierFlag=="TRUE")
print(outliers)
vcf1 <- read.vcfR("SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf")
vcfann <- as.data.frame(getFIX(vcf1))
vcfann[outliers,]