Making a reference transcriptome and Mapping reads to the reference transcriptome

Recall that the general RNAseq data processing work flow is to:

  1. Clean and evaluate reads (.fastq)

  2. Make and evaluate a transcriptome assembly (.fasta)

  3. Map cleaned reads to the transcriptome assembly (makes .sam files)

  4. From these sequence alignment files, we can extract two types of information: (a) read counts - the number of reads that uniqely map to each “gene” and (b) single nucleotide polymorphisms between a sample and the reference.

  5. With these two types of data, we can go on to differential gene expression analyses and population genomics.

Settling on a high quality reference transcriptome is an iterative process that requires testing different assembly parameters and inputs and evaluating the quality several ways. A quick way for us to move forward with our data analyses for this course, however, is to predict open reading frames (ORFs - include start and stop codons) and keep only transcripts that are at least 100 amino acids long. To do this we can use the program TransDecoder.

Download TransDecoder to use to predict longest Open Reading Frames (ORFs)

wget https://github.com/TransDecoder/TransDecoder/archive/v3.0.1.zip
$ cd /data/project_data/assembly/
$ /data/popgen/TransDecoder-3.0.1/TransDecoder.LongOrfs -t Trinity.fasta 


-first extracting base frequencies, we'll need them later.
CMD: /data/popgen/TransDecoder-3.0.1/util/compute_base_probs.pl Trinity.fasta 0 > Trinity.fasta.transdecoder_dir/base_freqs.dat
CMD: touch Trinity.fasta.transdecoder_dir/base_freqs.dat.ok


- extracting ORFs from transcripts.
-total transcripts to examine: 73435
[73400/73435] = 99.95% done    

#################################
### Done preparing long ORFs.  ###
##################################

    Use file: Trinity.fasta.transdecoder_dir/longest_orfs.pep  for Pfam and/or BlastP searches to enable homology-based coding region identification.

    Then, run TransDecoder.Predict for your final coding region predictions.

Evaluate the “longest_orfs.cds” assembly after running Transdecoder.

$ /data/popgen/trinityrnaseq-Trinity-v2.3.2/util/TrinityStats.pl longest_orfs.cds


################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  5693
Total trinity transcripts:  8573
Percent GC: 49.08

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 1626
    Contig N20: 1032
    Contig N30: 780
    Contig N40: 612
    Contig N50: 498

    Median contig length: 390
    Average contig: 518.78
    Total assembled bases: 4447500


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 1623
    Contig N20: 1080
    Contig N30: 828
    Contig N40: 642
    Contig N50: 528

    Median contig length: 402
    Average contig: 534.41
    Total assembled bases: 3042405

We can also evaluate this assembly by using blastp to compare it to the uniprot_swissprot database.

wget https://github.com/Trinotate/Trinotate/releases

# Run the script to download the relevant databases.
/data/popgen/Trinotate-3.0.1/admin/Build_Trinotate_Boilerplate_SQLite_db.pl  Trinotate
#!/bin/bash/
cd /data/popgen/databases/
makeblastdb -in uniprot_sprot.pep -dbtype prot -out uniprot_sprot
blastp -query /data/project_data/assembly/Trinity.fasta.transdecoder_dir/longest_orfs.pep  -db /data/popgen/databases/uniprot_sprot  -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 10 > blastp.outfmt6
TransDecoder.Predict -t target_transcripts.fasta --retain_blastp_hits blastp.outfmt6

These transcriptome assembly processes are ongoing. But for now we will map to the 5,693 “genes” based on the longest ORFs.

Options for improving this assembly include: (1) using more reads from other individuals or trying a different individual, (2) changing the cleaning and assembly parameters. We can evaluate based on the percentage of genes that have good blastp hits and the percentage of single copy orthologs included in the reference (for example using the new program BUSCO.

Map reads from individual samples to reference transcriptome

  1. Navigate to the /data/scripts/ directory to find a script called bwaaln.sh that you can
  2. copy cp to your home directory ~/scritps/ and
  3. open with vim to edit.
  4. You need to enter your “left” reads file name (for those cleaned and paired).
  5. Step through the script to make sure you understand each command.

The script looks like this:

#!/bin/bash 
 
# To run from present directory and save output: ./bwaaln.sh > output.bwaaln.txt 

myLeft='38_6-24_S_5_R1.fq.gz_left_clean_paired.fq'
echo $myLeft

myRight=${myLeft/_R1.fq.gz_left/_R2.fq.gz_right} 
echo $myRight

myShort=`echo $myLeft | cut -c1-11`
echo $myShort

# bwa index /data/project_data/assembly/longest_orfs.cds  # This only needs to be done once on the reference

bwa aln /data/project_data/assembly/longest_orfs.cds /data/project_data/fastq/cleanreads/$myLeft > $myLeft".sai"
bwa aln /data/project_data/assembly/longest_orfs.cds /data/project_data/fastq/cleanreads/$myRight > $myRight".sai"
bwa sampe -r '@RG\tID:'"$myShort"'\tSM:'"$myShort"'\tPL:Illumina' \
        -P /data/project_data/assembly/longest_orfs.cds $myLeft".sai" $myRight".sai" \
        /data/project_data/fastq/cleanreads/$myLeft \
        /data/project_data/fastq/cleanreads/$myRight > $myShort"_bwaaln.sam"

This script could also be made into a loop to map reads of all files one after another to the transcriptome. We could also tidy up file names using mv or rename.

You’ve made a Sequence AlignMent (SAM) file!

A SAM file is a tab delimited text file that stores information about the alignment of reads in a FASTQ file to a reference genome or transcriptome. For each read in a FASTQ file, there’s a line in the SAM file that includes

  • the read, aka. query, name,

  • a FLAG (number with information about mapping success and orientation and whether the read is the left or right read),

  • the reference sequence name to which the read mapped

  • the leftmost position in the reference where the read mapped

  • the mapping quality (Phred-scaled)

  • a CIGAR string that gives alignment information (how many bases Match (M), where there’s an Insertion (I) or Deletion (D))

  • an ‘=’, mate position, inferred insert size (columns 7,8,9),

  • the query sequence and Phred-scaled quality from the FASTQ file (columns 10 and 11),

  • then Lots of good information in TAGS at the end, if the read mapped, including whether it is a unique read (XT:A:U), the number of best hits (X0:i:1), the number of suboptimal hits (X1:i:0).

The left (R1) and right (R2) reads alternate through the file. SAM files usually have a header section with general information where each line starts with the ‘@’ symbol. SAM and BAM files contain the same information; SAM is human readable and BAM is in binary code and therefore has a smaller file size.

Find the official Sequence AlignMent file documentation can be found here.

This BWA man page also discusses SAM alignment format and BWA specific optional fields.

Let’s check out our .sam files! Try head and tail.

tail -n 100 YOURFILENAME.sam > tail.sam
vim tail.sam

:set nowrap