When doing population genomics on large genome-wide or transcriptome-wide datasets, we generally want to work with files that contain just the polymoprhic sites and omit sites that are fixed. But there's also a lot of metadata from our assembly (remember those good 'ol sam files?) that will be important for analyzing the SNP data downstream.
These are things like:
As usual, the community has converged on a common standard to represent these large and sometimes complex SNP data files. It is known as the Variant Call Format, or VCF. Here's a link to the description of what each field in a VCF file means: VCF version 4.3 file definition
I called SNPs using the method implemented in Gayral et al. (2014) called 'reads2snp'. To get us started for today, I used just a single sample library from each individual (the first dated one). Here's the list:
03_5-08_S_2 07_5-08_S_1 08_5-08_H_0 09_5-08_H_0 10_5-08_H_0 14_5-08_S_2 15_5-08_H_0 19_5-11_H_0 20_5-08_H_0 22_5-08_S_1 23_5-17_S_2 24_5-08_H_0 26_5-08_S_2 27_5-08_H_0 28_5-08_S_1 29_5-08_S_2 31_6-12_H_0 32_6-12_H_0 33_6-12_H_0 34_6-12_H_0 35_6-12_H_0 36_6-12_S_1 37_6-12_H_0 38_6-12_H_0
Eventually, we'll work with all the reads from a given individual by combining across libraries for the same genotype prior to SNP calling. More on that later...
We'll be working with vcf files a lot as we conduct the population genomics section of the course. The first step in learning how to work with these files is to use a program called VCFTools for parsing your data file into just those samples and sites of interest, and to calculate diversity stats on these.
The manual page for VCFtools is an excellent resource! The latest version is here.
Now: cd to the directory /data/project_data/snps/reads2snps/
and do an ll using the wildcard *vcf…. you should see the following vcf files in the directory:
[srkeller@pbio381 reads2snps]$ ll *vcf -rw-r--r--. 1 srkeller users 17426 Mar 6 06:18 head_SSW_bamlist.txt.vcf -rw-r--r--. 1 srkeller users 1364923952 Mar 6 02:37 SSW_bamlist.txt.vcf [srkeller@pbio381 reads2snps]$
The file you want is SSW_bamlist.txt.vcf. The other file is just the header so you can look at the formatting of the file in VIM without having to open the big file.
We're going to use VCFtools to examine the effects of different filtering strategies on the number of SNPs we get and their quality. The first step is seeing if VCFtools likes our file format, and getting some basic info on the # of SNPs and samples.
xxxxxxxxxx
$ vcftools --vcf filename.vcf
This will return some basic info that should match of general expectations of sample size.
Did it detect the correct number of individuals?
How many SNPs do we have?
During SNP calling, reads2snp applied the following fairly stringent criteria when calling SNPs:
Any SNPs that didn't meet that criteria were flagged as unres (=unresolved) and set to missing data in the vcf file. Similarly, loci that show evidence of paralogy were flagged as para.
How could we quickly find out how many SNPs were flagged as unresolved?
What about the number affected by paralogy?
Now, let's try filtering out positions that are likely to be errors in the sequencing or genotyping process. For now, let's just identify how many SNPs would pass each filter without actually changing the datafile at all. Then, we can decide what combination of filters we may want to implement.
Biallelic vs. multi-allelic SNPs: Keep only sites with 2 alleles.
xxxxxxxxxx
$ vcftools --vcf filename.vcf --min-alleles 2 --max-alleles 2
Minor allele frequency (MAF): Gets rid of very rare SNPs (based on a user-defined threshold).
xxxxxxxxxx
$ vcftools --vcf filename.vcf --maf 0.02
Missing data across individuals: Get rid of sites where fewer than 80% of our samples have data.
xxxxxxxxxx
$ vcftools --vcf filename.vcf --max-missing 0.8
Now, it's time to combine filters instead of applying them one at a time. NOTE: VCFtools processes the filter requests in the order that you give it at the command-line. This is a key point, and means that if you apply the same filters in different orders, you will likely get different results!
xxxxxxxxxx
$ vcftools --vcf filename.vcf --min-alleles 2 --max-alleles 2 --maf 0.02 --max-missing 0.8 --recode --out ~/biallelic.MAF0.02.Miss0.8
Note that I re-directed the output file to my home directory. You should do the same!
VCFtools also can provide output in the form of many useful summary stats on a vcf file. Let's look at the observed and expected heterozygosity for each SNPs and test if any violate Hardy-Weinberg equilibrium expectations: (1=p^2 + 2pq + q^2). Use the quality-filtered file we generated above as input.
$ vcftools --vcf filtered_filename.vcf --hardy
You can then bring this into R to take a look at which sites show deviation from HWE...