Analyzing 16s AmipliconSeq Data

For these analyses, we will be using QIIME and phyloseq (an R package). QIIME will be used to make and filter the OTU table and phyloseq will be used to visualize the data and test hypotheses. There are many, many programs to analyze 16s data, these are only a few options!

Before we begin, you should make a directory to store files related to this section of the class. For example:

Making the OTU table

OTU: Operational Taxanomic Unit. OTUs are clusters of microorganisms grouped by sequence similarity of a specific taxonomic marker gene, in this case the 16s gene.

An OTU table is a table of read counts of each OTU in each sample. The following tutorial will show you how to use QIIME to culster sequence reads into OTUs and make the OTU table.

The files needed for these steps are all of the 16s ampliconseq files (of which there are 352 total but we will be using a subset) and a mapping file which holds all of the sample metadata.

Making the mapping file

The setup of your mapping file will be crucial to how easily you can perform analyses. Your mapping metadata will be one of the most important parts of running most commands. The mapping file can always be modified as long as the SampleIDs do not change. The mapping file is meant to give meaning to the SampleIDs of the count data. The OTU table only contains the number of counts each sample has per OTU but it does not contain any information regarding treatment group or study timepoint. The sample mapping file allows us to provide as many columns needed to describe each sample in different ways.

The mapping file always includes:

‘#SampleID’ A unique identifier meant for a particular sample. It is always prefixed by a #

BarcodeSequence: The unique barcode sequence assigned to the particlar sample during library preparation. QIIME has built in capability to demultiplex sequence data which requires the barcoding sequence and linker primer sequence. The barcode and linker primer sequences have already been removed from out data so these columns will be blank

LinkerPrimerSequence: The nucleotide sequence adjacent to each barcode, meant for use during sequencing.

Description: The final column in the mapping file that can correspond to any information about a sample.

I found making the mapping file to be a little finicky. For example, it doesn’t like underscores ’_‘so those have all been changed to dashes’-’. The SampleIDs in the mapping file need to match the file names of the sequence data exactly.

The mapping file is located here:

/data/project_data/16s/map.txt

Phenotype refers is whether the sea star was sick or healthy at the time that sample was taken. The phenotype may not be the same for all samples from 1 individual.

Pheno_num refers to the quantification of symptoms at the time the sample was take (0 refers to healthy, 1-4 are increasing levels of disease, and 5 is dead).

Final_phenotype refers to the end phenotype for that individual. The final phenotype is the sample for all samples from 1 individual.

Before we use the map, we must validate it to make sure it passes all of QIIME’s quality checks.

validate_mapping_file.py -m /data/project_data/16s/map.txt -o validate_map -p -b

The -p disables the primer check and -b means the data is not barcoded.

The output of this command is a directory which contains map.html which we will look at by moving it to our zoo account.

Pairing paired end reads and quality filtering

This is paired end data so we will first pair the R1 and R2 mates.

multiple_join_paired_ends.py -i /data/project_data/16s/data_files -o ~/16s_analysis/joined --read1_indicator _R1 --read2_indicator _R2

This takes about 5 minutes

The output of this command will be one directory per sample. In each directory there will be 3 files: fastqjoin.join.fastq, fastqjoin.un1.fastq, and fastqjoin.un2.fastq. fastqjoin.join.fastq contains the joined reads while the other two contain the unmatched reads.

Now we will quality filter reads and join all of the sequence data from the individual samples into one file. The name of this command is slightly confusing (‘multiple_split_libraries_fastq.py’). That is because in the QIIME pipeline, sequence data is generally demultiplexed and quality filtered at the sample time at this step.

We first need to make sure that the names of the directories in ~/16s_analysis/joined perfectly match the SampleIDs in our mapping file. remove-underscore.sh and remove-R1.sh will help with this.

cd ~/16s_analysis/joined
bash /data/project_data/16s/remove-underscore.sh
bash /data/project_data/16s/remove-R1.sh

multiple_split_libraries_fastq.py -i ~/16s_analysis/joined -o ~/16s_analysis/filtered -m sampleid_by_file --include_input_dir_path --remove_filepath_in_name  --mapping_indicator ~/16s_analysis/map.txt

This took about 10 minutes

-m is the demultiplexing method. With the sampleid_by_file option, each fastq file (and/or directory name) will be used to generate the -sample_ids value passed to

–include_input_dir_path Include the input directory name in the output sample id name. Useful in cases where the file names are repeated in input folders

–remove_filepath_in_name Disable inclusion of the input filename in the output sample id names. Must use -include_input_dir_path if this option is enabled

See here for default parameters for quality filtering: http://qiime.org/scripts/split_libraries_fastq.html If you want to change any of these parameters, you can make a .txt file of all the changes and include it in the command with the -p flag

The output of this command is a directory containing some files, one of which we want: seqs.fna. It’s really important that the information about sample ID is incorperated correctly into the seqs.fna file!

head the file to make sure you see the correct SampleIDs.

cd ~/16s_analysis/filtered

head seqs.fna

>02-5-05_0 M02780:138:000000000-ATTB1:1:1101:21367:1238 1:N:0:GTAGCGGA+CGTCTAAT orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
CCTACGGGCGGCTGCAGTGGGGAATCTTGCACAATGGAGGAAACTCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTGAGGAGGAATGGTTAGTAGTTAATAATTGCTTGCTGTGACGTTACTCGCAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTGATCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCAGAGCTCAACTTGGGGACTGCATTTTAA

Also, run this command on a few samples and make sure the output ‘test’ file contains some sequences

extract_seqs_by_sample_id.py -i seqs.fna -o test -s 04-5-05

We can remove the test file when we’re done

rm test

Clustering OTUs and building the OTU table

This step takes a very long time so we’ll start the command on the subset of data we’ve been using and I will provide the full OTU table for all of the samples.

QIIME has 3 OTU picking stategies: Closed reference OTU picking (picks OTUs based on sequence similarity to a referece. Won’t identify any OTUs that aren’t in the reference database), de novo OTU picking (picks OTUs only on sequence similarity to other sequences in your data. Very computationally intensive), and a combination of the two open reference OTU picking. We will use the open reference OTU picking with the pick_open_reference_otus.py command. This first uses closed reference OTU picking and then de novo OTU picking for the sequences that didn’t fall into OTUS in the closed reference picking.

See here for more details about what is happening behind the scenes during this command: http://qiime.org/scripts/pick_open_reference_otus.html

cd ~/16s_analysis

pick_open_reference_otus.py -i ~/16s_analysis/filtered/seqs.fna -o ~/16s_analysis/otus  --parallel --jobs_to_start 16

##Once we see that no errors are occuring, we will kill the command and start using the pre made OTU table

Have a look at the files produces by the pick_open_reference_otus.py. In the interest of time, we won’t go over what all of these files are but the above links gives a great description of what they are and how they were made. We just want to make sure there are the correct number of samples in the biom file and get a sense of the number of reads per sample.

cd /data/project_data/16s/otu_table

ll

biom summarize-table -i /data/project_data/16s/otu_table/otu_table_mc2_w_tax_no_pynast_failures.biom

Filtering the OTU table

16s datasets generally contain a lot of noise from chimeric sequences and very low abundance OTUs.

Chimera filtering

Chimeric sequences are hybrid products between multiple parent sequences that can be falsely interpreted as novel organisms. The method for removing chimeras built into QIMME is ChimeraSlayer which is no longer recommended for use. Instead we will use USEARCH/VSEARCH. USEARCH is is a program that requires a pay-for licesne but VSEARCH is an open source free alternative so we’re going to use VSEARCH (website: https://github.com/torognes/vsearch).

We will follow this general pipeline:

use vchime_ref to remove potential chimeric sequences from my rep_set.fna
Remove OTUs matching those identified as chimeric from OTU table
Remove OTUs from rep_set_aligned and make phylogenetic tree

1. Identify chimeric OTUs

vsearch --uchime_ref /data/project_data/16s/otu_table/rep_set.fna --chimeras ~/16s_analysis/mc2_w_tax_no_pynast_failures_chimeras.fasta --db /usr/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

2. Remove these chimeric OTUs from OTU table

filter_otus_from_otu_table.py -i /data/project_data/16s/otu_table/otu_table_mc2_w_tax_no_pynast_failures.biom -o otu_table_mc2_w_tax_no_pynast_failures_no_chimeras.biom -e ~/16s_analysis/mc2_w_tax_no_pynast_failures_chimeras.fasta

3. Remove these chimeric OTUs from rep_set_aligned and re-make the phylogenetic tree

filter_fasta.py -f /data/project_data/16s/otu_table/pynast_aligned_seqs/rep_set_aligned_pfiltered.fasta -o ~/16s_analysis/rep_set_aligned_pfiltered_no_chimeras.fasta -a ~/16s_analysis/mc2_w_tax_no_pynast_failures_chimeras.fasta -n

make_phylogeny.py -i ~/16s_analysis/rep_set_aligned_pfiltered_no_chimeras.fasta -o ~/16s_analysis/rep_set_no_chimeras.tre

This command takes a while

Frequency filtering

Now we will get rid of low frequency OTUs. We will get rid of any OTU with fewer than 50 total counts across all samples (–min_count 50) as well as any OTU that is in fewer than 25% of samples (–min_samples 44)

filter_otus_from_otu_table.py -i otu_table_mc2_w_tax_no_pynast_failures_no_chimeras.biom -o otu_table_mc2_w_tax_no_pynast_failures_no_chimeras_frequency_filtered.biom --min_count 50 --min_samples 44

##How many OTUs are left?
biom summarize-table -i otu_table_mc2_w_tax_no_pynast_failures_no_chimeras_frequency_filtered.biom

Core diversity analyses in QIIME

core_diversity_analyses.py -o core_diversity_filtered -i otu_table_mc2_w_tax_no_pynast_failures_no_chimeras_frequency_filtered.biom -m ~/Po/MiSeq/joined/map.txt -t rep_set_no_chimeras.tre -e 20000 -a 8

This command also takes a long time so we’ll just look at the output online: http://www.uvm.edu/~mlloyd/mc2_w_tax_no_pynast_failures_no_chimeras_frequency_filtered_core_diversity/

As we start to visualize the data and test hypotheses, it’s important to think about how the data is normalized. As with all sequencing data, there will be more reads for some samples than others due to technical aspects of the sequencing. Generally, the options for normalizing are to rarefy the data (randomly select the same number of reads from each sample), transform count numbers to proportional abundances, or normalize the count numbers by the total size of that sample’s library. In the above command, the ‘-e 20000’ indicates that the data will be rarefied to 20,000 counts per sample.

The appropriate normalization method might be different for different tests or analyses, so just be aware of which one is being used.

Alpha Diversity

Are there more OTUs observed in different groups? Species richness in the microbial community

Beta Diversity

Are there different OTUs observed in different groups?

Unweighted vs weighted: Quantitative measures (e.g. weighted unifrac) are ideally suited to revealing community differences that are due to changes in relative taxon abundance (e.g., when a particular set of taxa flourish because a limiting nutrient source becomes abundant). Qualitative measures (e.g. unweighted unifrac) are most informative when communities differ primarily by what can live in them (e.g., at high temperatures), in part because abundance information can obscure significant patterns of variation in which taxa are present.

The PCA plots show there may be a difference between sick/healthy individuals. How can we test that?

compare_categories.py http://qiime.org/scripts/compare_categories.html

To run this, you will need the distance matrix from the core_diversity_analysis. It is in /data/project_data/16s/core_diversity_analysis

compare_categories.py --method adonis -i /data/project_data/16s/core_diversity_analysis/weighted_unifrac_dm.txt -m ~/data/project_data/16s/map.txt -c Fianl_phenotype -o adonis_out.txt -n 999

Change out the –method and see if your results change.

And now we want to know which OTUs are different between groups. For this question, we need Phyloseq…