Where last we left off…...
We were filtering SNPs and looking at how SNPs may show signs of deviation from Hardy-Weinberg equilibrium, either as a heterozygote excess OR a deficit. Let's check this again, but now we're working with ALL the sequence libraries merged per individual. Yeah! Also, we're now going to use a compressed (gzipped) file to save some space.
Let's first re-apply our filters, and zip up the resulting filtered output file. Then we can take a look at Hardy-Weinberg Equilibrium (HWE).
PATH TO THE DATA:
/data/_project_data/snps/reads2snps/SSW_byind.txt.vcf.gz
VCFtools filtering strategy (same as last session):
xxxxxxxxxx
$ vcftools --gzvcf SSW_byind.txt.vcf.gz --min-alleles 2 --max-alleles 2 --maf 0.02 --max-missing 0.8 --recode --out ~/SSW_all_biallelic.MAF0.02.Miss0.8
$ gzip SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf
$ vcftools --gzvcf SSW_all_biallelic.MAF0.02.Miss0.8.recode.vcf --hardy
You can then bring the HWE output file (called "out.hwe") into R to take a look at which sites show deviation of observed from expected heterozygosity. Let's do this all together.
In the last session, we got familiar with working with SNP data in VCF files, and doing some basic filtering. Now that we have a filtered SNP dataset that has high-quality sites in it, let's look at some different measures of genetic diversity in our sea star population.
Keep in mind that the diversity of a population primarily reflects its effective population size (Ne). Ne is shaped by many different aspects of a species' life history and ecology (sex ratio, generation time, mating system, offspring number, and many more!) as well as the population's history (bottlenecks, population growth). As a result, looking at the diversity within populations (and comparing to other populations or species) is a critical step in understanding how ecology shapes genomes.
Now: Let's take a more in-depth look at the diversity hidden within these sea star data. There are many different ways to look at the diversity within populations using SNPs. Here are some that we'll think about for today:
We've been talking about pi in the last 2 papers, and seen that this relates to effective population size (Ne). Let's calculate pi first:
xxxxxxxxxx
$ vcftools --gzvcf filenamevcf.gz --site-pi --out SSW_pi
Many times we'll want to subset the total SNP dataset to analyze diversity in different groups. Say, compare allele frequencies in all the Healthy vs. Sick seastars. Or Intertidal vs. Subtidal. This is easy, you just need to create a separate text file containing which samples below to which groups so you can tell VCFtools how to split things up.
As an example, let's compare the SNP frequencies for all loci between Healthy and Sick animals. Perhaps there are some loci that contribute to a difference in pathogen susceptibility, which could be identified this way. Let's take a look.
First, you need to create text files containing the individual ID's for just Sick individuals. The following file has all the sample ID's in it that are part of your VCF file (SNPs=Y,N), along with Health Trajectory (HH,HS,SS,MM), and Location (INT,SUB):
/data/project_data/snps/reads2snps/ssw_healthloc.txt
Use this file to get just the healthy individual sample IDs.
How can we create these files in a clever, unix-y way? HINT: grep! ;)
Name it:
Create another for just the sick individuals:
We'll also want to remove all but the first column of data — the sample IDs. Here's a trick:
$ cut -f1 H_OneSampPerInd.txt >H_SampleIDs.txt
Do the same for Sick individuals.
Now that we have our individuals separated by disease status, we can call VCFtools to calculate allele frequencies separately for each population. This will require 2 separate calls to VCFtools.
Allele Frequencies between Healthy and Sick individuals:
xxxxxxxxxx
$ vcftools --vcf filename.vcf --freq2 --keep H_SampleIDs.txt --out H_AlleleFreqs
xxxxxxxxxx
$ vcftools --vcf filename.vcf --freq2 --keep S_SampleIDs.txt --out S_AlleleFreqs
Before we bring this into R for plotting, let's gather one more comparison between our groups.
Recall that Wright's Fst measures allele frequency variance between groups, but standardizes the estimate based on the mean frequencies within groups. This gives a complimentary view to just comparing the raw frequencies.
Fst between Healthy and Sick individuals:
xxxxxxxxxx
$ vcftools --vcf ~/filename.vcf --weir-fst-pop ~/H_SampleIDs.txt --weir-fst-pop ~/S_SampleIDs.txt --out HvS_OneSampPerInd
Now, we can import these datasets into R and make some plots to examine how the diversity varies in our dataset. We can get the data into R in one of 2 ways:
x[srkeller@pbio381 reads2snps]$ R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
> fst <- read.table("HvS_OneSampPerInd.weir.fst",header=T)
> str(fst)
'data.frame': 442 obs. of 3 variables:
$ CHROM : Factor w/ 111 levels "TRINITY_DN35598_c0_g1_TRINITY_DN35598_c0_g1_i1_g.5802_m.5802",..: 65 65 100 100 100 100 100 100 88 88 ...
$ POS : int 4566 4665 978 1404 1722 3426 3729 3912 115 141 ...
$ WEIR_AND_COCKERHAM_FST: num 0.0305 0.0085 0.0305 -0.0188 0.0732 ...
OK, looks like the data read in to R OK. Let's make a quick histogram of Fst and save as a pdf in our home directory on the server (~/)
xxxxxxxxxx
> pdf("HvS_Fst.pdf")
> hist(fst$WEIR_AND_COCKERHAM_FST, breaks=20, col="red")
> dev.off()
null device
1
Let's also bring in the Allele Frequency results we got from VCFtools. Two interesting results to look for are:
Calculating allele frequency differences between Healthy vs. Sick individuals for each SNP
Calculate the Site Frequency Spectrum (SFS), which is simply a histogram of the allele frequencies across loci
When you're done, end your R session and return to the command line:
xxxxxxxxxx
> quit()
Save workspace image? [y/n/c]: n
[srkeller@pbio381 ~]$
Romiguier et al. report on some very intriguing associations between species life history traits, nucleotide diversity at synonymous sites (piS), and the ratio of piN/piS (where piN is nucleotide diversity at nonsynonymous site).
What do our sea star data have to say about this? Or more importantly: where do sea stars fall on the genomic diversity ~ life history continuum?
Estimating piS and piN on the entire dataset will take some time. Let's try and set this up at the end of the day and let it run. We'll use the piNpiS program from Gayral et al. (2013) to run this. We only need a single input file, which is a FASTA formatted sequence file that is output from reads2snps
xxxxxxxxxx
$ cd /data/project_data/snps/reads2snps
[srkeller@pbio381 reads2snps]$ /data/popgen/dNdSpiNpiS_1.0 -alignment_file=SSW_byind.txt.fas -ingroup=sp
While we wait for that to chug along, we can look at the output from the smaller VCF file run on just 1 sample library per individual. The output gives several files; here are the important ones:
Detailed results file with nucleotide diversity calculated for each gene:
/data/project_data/snps/reads2snps/SSW_bamlist.txt.out
Overall summary of results, including mean (and confidence intervals) of diversity when averaged across all expressed genes:
xxxxxxxxxx
/data/project_data/snps/reads2snps/SSW_bamlist.txt.sum
To compare the mean values across genes to Romiguier's data, we need to get their estimates and combine them with our estimates. I've downloaded Table S3 to our server and saved as a common separated (.csv) file:
xxxxxxxxxx
/data/project_data/snps/reads2snps/Romiguier_nature13685-s3.csv