2017-04-05_annotation

Functional annotation of de novo transcriptome assemblies

What are the genes (or proteins) that we are studying?! You are probably particularly interested in the genes that we have been identifying as differentially expressed or as FST outliers. Rather than using the web-based NCBI blast to look up genes one-by-one, we can annotate our assembly. This will help us evaluate the quality of the assembly and associate genes with protein functions for interesting downstream enrichment analyses.

When working with a non-model or “emerging model” organism, one needs to annotate predicted genes based on homology to other organisms. The annotation of de novo assemblies relies of alignment of genes (or predicted amino acid sequences) to currated databases. With 10,000s of transcripts, we cannot do a batch submission to the NCBI BLAST website. Instead we can install the BLAST+ programs on our server, download and format the NR and UniProt databases, and run BLAST on our server!

NR is a protein database compiled by the NCBI (National Center for Biotechnology Information) that includes non-identical sequences from GenBank CDS translations, Swiss-Prot, and other databases. It is exhaustive; and if downloading the pre-formatted files, includes the species-level taxonomy IDs. The database is broken up into numbered volumes as it is ever expanding. It can be downloaded from the ncbi ftp site ftp://ftp.ncbi.nlm.nih.gov/blast/db/.

UniProt is another protein database that is highly currated and combines the Swiss-Prot, TrEMBL and PIR-PSD databases. Each protein has a unique uniprot ID that is associated with a wealth of other information, including Gene Ontology GO categories. See the documentation for more information http://www.uniprot.org/help/uniref. The various databases can be downloaded from here using ftp or wget http://www.ebi.ac.uk/uniprot/database/download.html.

And here is where you can find the blast command line programs. Again, you DO NOT need to download or install these databases or programs on the server!

Once we have the databases downloaded and unpacked, they need to formatted or indexed to facilitate the BLAST. We downloaded the pre-formatted NR database, but for the uniref90 we needed to run the following command.

$ makeblastdb -in uniprot_sprot.pep -dbtype prot

Note: We only need to format a database (i.e., run the above command) once.

Why might we want to BLAST to different databases?

There are two ways we can BLAST our transcripts to a protein database, (1) We can translate it to amino acid sequence ahead of time and using blastp (blastp: protein query to protein subject).

(2) Use blastx to translate our nucleotide sequence in all 6 possible frames and align each (blastx: translated nucleotide query to protein subject).

Below is an example script that uses blastp to align to the uniprot database. Notice the inputs needed (-query, -db) and the options available (-evalue, -max_target_seqs). This and other example scripts can be found in the /data/scripts/ directory.

#!/bin/bash

# This single line using the blastp command below will compare your transcript fasta file
# (-query) to the already formatted uniref90 database (-db).
# You can enter 'blastp --help' for a list of the parameters.
# We choose the tab-delimited output format (6) and to only help the top hit (-max_target_seqs)
# and only if it has a minimum evalue of 0.00001.

blastp -query /data/project_data/assembly/08-11-35-36_cl20_longest_orfs_gene.cds \
       -db /data/project_data/assembly/database/uniref90/uniprot_sprot.pep \
       -out /data/project_data/assembly/blast/blastp_vs_uniprot.outfmt6 \
       -outfmt 6 \
       -evalue 1e-5 \
       -max_target_seqs 1

Making an annotation table

Once we have our blast results to uniprot and nr, we can assemble an annotation table using the merge function in R.

At the path below you can find a merged annotation table that combines the blast results from nr and uniprot and the associated GO and taxa IDs.

/data/project_data/enrichment/poch_uniprot_GO_nr.txt

Functional enrichment analysis

Functional enrichment analyses test for enrichment (high values) of a particular quantitative metric (e.g. FST or p-value for DGE) in classes of genes that serve specific functional roles (e.g. immune response to virus, amino acid metabolism, photoperiod sensory system, etc.). A great way to do them is to use the whole distribution of whatever metric of interest, rather than a cutoff. The Dixon et al, coral paper we read developed an approach to test for significant enrichment using the nonparametric Mann-Whitney U test.

We will do that with any part of our results (FST or p-value for DGE) today!

Move all the files from the directory below into a directory on your computer using scp, Winscp, Fetch, etc.

/data/project_data/enrichment/*

For your future reference, here's a link to the github page where Dr. Matz' makes the code for running this functional enrichment test publically available! You don't need to redownload anything - it's all in the directory above, including the databases and the annotation table that I made with the blast results.

(1) Let's work through the GO_MWU_class.R script together. Then...

(2) Come up with your own metric to test for functional enrichment (FST, etc).

NOTE: The program is very specific about file formatting. I suggest opening any results file in TextWrangler and checking that the save as is in unix format… see my note in the script.