I am the teaching assistant, but I will follow the tutorials and log what I've done here.
Science should be reproducible and one of the best ways to achieve this is by logging research activities in a notebook. Because science/biology has increasingly become computational, it is easier to document computational projects in an electronic form, which can be shared online through Github.
This work is licensed under a Creative Commons Attribution 4.0 International License
.
Steve: It is a young field, trying to establish it's own identity
DATA driven: next gen sequencing revolutionizes biology
Where is the field headed
Molecular Ecology Journal(flagship journal representative o the field)
What types of questions are asked?
Methods?
De novo genome assembly; sequencing a DNA book from scratch!!
16 s metagenomic sequencing
Rad-seq/GBS for estiamting population structure and genetic diversity
Proccesses studied?
Goals of the course!
Background, what drove Melissa and Steve to ecological genomics?
Melissa read a cool paper that scales from analyzing a few loci to the whole genome.
One figure popped out at her, FST (developed by Sewell Wright) histogram. FST of 1= complete differentiation, FST of 0 = no diff. FST described as Alleles in space. From this histogram, Melissa was struck by how you can separate out neutral from selective ones.
Melissa has a data set with 96 sea stars and then the 16s microbiome. Would be cool to see if there is heritability in some bactera
Inspired by Yanis Antonivics (an OG)
At the time, just so stories: Adaptatationist programme
Yanis was on Steve's committee and Steve was interested in adaptation with respect to invasion biology because organisms need to respond to novel environments
Steve thinks about environment-phenotype-genetics triangle. Basically a path diagram that feeds back on each other.
Invasion history is tough because of demographic history
He decided to focus on trees; large population size, straddle huge environmental gradients so the opportunity for selection is high
First, showing how I structured the readings in the repo:
cd Teaching/2017_Ecological_Genomics/
For fun, list the items in your repo
ls 01_data Online_notebook.md 02_scripts README.md 03_results_output index.Rmd 04_tutorials index.html 2017_Ecological_Genomics.Rproj index.pdf ANBE_notebook.md papers
tree . ├── 01_data ├── 02_scripts │ └── RasterPCA_demo.Rmd ├── 03_results_output │ └── RasterPCA_demo.html ├── 04_tutorials ├── 2017_Ecological_Genomics.Rproj ├── ANBE_notebook.md ├── Online_notebook.md ├── README.md ├── index.Rmd ├── index.html ├── index.pdf └── papers └── 01_week └── 01_Day_Monday_2017-01-25 ├── Ellegren_2014.pdf ├── Lee_gould_stinchcombe_2014_AoB_PLANTS.pdf └── Rockman-2012-Evolution.pdf
So my plan is to place my papers in the "papers" directory, within a particular week, and then within a particular day number of the course that is annotated with the actual day and date.
Genome Sequencing and population genomics in non-model organisms ; Hans Ellegren
3 important achievements in bio over the past century: modern synthesis(evolutionary theory), mol bio, and "omics" era
Basically, because we can sequence everything we can collect, we can study non-model organisms like never before.
In fact, many genomes are sequenced
Bird example: Chickens were sequenced first, then zebra finch, then a bunch of others, highlighting the rapid ability to sequence genomes
Genomic information is nice because it can tell you how loci are arranged. Example, recombination can be better studied by knowing the arrangement of chromosomes.
Having genomic sequences allows for us to compare the repertoire of genes and the actual sequences themselves. (birth and death evolutionary procceses in homologues; tests for positive selection)
Identifying the genes underlying quantitative traits: a rationale for the QTP programme; Lee et al. 2014
QTN = quantitative trait nucleotide; associating single nucleotides with quantitative traits
Critical question: What is the molecular basis for adaptation? particularly in non-model organisms
Response to Rockman 2012 paper
Rockman's major criticisms: Effect of nucleotide on traits can be overestimated
Travisano and Shaw 2013 raise the criticism that you don't need QTN's to focus on the patterns and process of adaptation
5 reasons why we should care about QTNs
-Papers- info updates, and discussion papers; students need to sign up
Add definitions as you see it
Info update rubric
Reads: a length of sequenced DNA
* short = 50 bps; long = 100, 150, 300, 10,000-60,000 bps * could be single or pair-end ( 1 strand or both strands)
Good example was the human genome project:
Then, Illumina released Hi seq X Ten sequencer in 2014
Illumina sequencing:
Similar to sanger but fluorescence can be observed across a whole "field".
Step 1: Goal-oriented; which platform to use?
where genetic variation is
demographic history?
adaptive gentic variation
gene expression variation under common garden conditions?
Trade-offs in doing the actual sequencing
length of reads
how read are distributed along the whole genome
Actual workflow now
Extract DNA or RNA(this needs to be converted into cDNA)
Fragment sample (break it into smaller junks)
Ligate adapters on the ends
Add on sequencing adaptors
PCR amplification
DNA with different adapters
get loaded into 1 of 8 lanes in a flowcell
flowcell have oligos that match the sequencing adaptors; so DNA attaches to it! (P5 and P7)
Bridge amplification: bend over and amplify, back and forth, over and over
Incorporates labeled A T G or C, and then images are taken across the whole flowcell
Pac-bio's platform: Single molecule real time sequencing (SMRT)
1 week from today, do a blitz on the different library preps (4 different ones)
2 volunteers to discuss QTN programme.
Each student, 1 discussion leader and 1 info update!
Tasks
One of my duties is to populate the glossary on blackboard. (I'll do it here for redundancy too.)
Restructured course layout:
. ├── 01_data ├── 02_scripts │ └── RasterPCA_demo.Rmd ├── 03_results_output │ └── RasterPCA_demo.html ├── 04_tutorials ├── 2017_Ecological_Genomics.Rproj ├── ANBE_notebook.md ├── Online_notebook.md ├── README.md ├── index.Rmd ├── index.html ├── index.pdf └── papers └── 01_week ├── Ellegren_2014.pdf ├── Lee_gould_stinchcombe_2014_AoB_PLANTS.pdf └── Rockman-2012-Evolution.pdf 6 directories, 12 files
It better reflects how the reading is done throughout the week.
Announcements:
Info update ~ QTN
Debate-style discussion
BREAK
Group project discussion!
Q: What are QTN's (Quantitative Trait Nucleotide)?
Quantitative genetic theory of adaptive traits
QTN is the most reductionist you can get. The individual SNPs that contribute to the variation in a trait. Usually traits under selection and confer adaptation.
Traits!
1. flowering time 2. flower color 3. thermal tolerance 4. venom potency 5. defense or secondary compounds in plants 6. toxin tolerance 7. drought tolerance 8. altitude tolerance (hypoxia)
They are quantitative, and are continuous. They have a mean and variance. Not descrete phenotypes. Discrete phenotypes are mendelian (major effect loci) .
Modern synthesis (Fischer, Haldane, WRight). Connect alleles with trait distribution.
ex: allele:trait
AA = 1
Aa = 2
aa = 3
The difference between Aa and aa = average effects or alpha (1).
AA = q^2
Aa = 2pq
aa= p^2
Additive genetic variance (Va) = sum of alpha * pi and qi. pi and qi
Vp = phenotypic variance and the Va is only a fraction. So the heritability is Va/Vp.
What are the genes that explain this heritability? Alpha is the effect size of the QTN
Most populations are close to their adaptive peak. Imagine fitness as a fucntion of a trait, normal distribution. Max fitness = intermediate trait is local adaptation. Change in environment will change the fitness peak.
A mutation is random, and it can move the fitness of the population up or down. This is usally a small effect mutation. But if you have a large effect mutation, then changes are, you will move down, maintain, or up, but if it pushes too far, it'll go down. It is must more effecient for the population to evolve many small effect loci. How do we detect this?
3 main methods
QTL mapping (forward genetics)
GWAS (Genome-wide association studies) (forward genetics)
Let nature do the crosses, and sample a bunch of individuals
genotype them,
Nature has been doing a similar experiment as in the QTL mapping but there are many more parents
Model: Y = u + Bi x SNPi + Covariates
You get a manhattan plot, plotting -log(pvalue) against position
Selection Scans (reverse genetics)
don't know trait, just zoom in on parts of the genome that have history of selection
Sample a bunch of individuals, with a bunch of SNPs
Selective sweep, some rise in frequency and others are not
Over T generations, it could fix—DECREASING diversity
Additional notes: Mutations before selection are equal across loci and their effect sizes. So after selection, the frequency of effect sizes are concentrated on the small values(small effect sizes) and very little at the large effect size.
2 separate groups discussions first.
Dr. Brody leading Rockman paper:
3 main arguments
Speciation example: Doug Schemsky
They speciated because they got different pollination syndromes (changed from yellow to red). It acted like a mendelian trait. Is it important in knowning speciation? Is it common? Speciation can be neutral just by building up different genetic interactions that leads to reproductive isolation.
Whole group Discussion
Melissa's dataset
Seastar wasting disease: kills stuff, PATHOGEN UNKNOWN
Some species are resistant and some rae susceptible
it is a generalist
A couple days or even hours that a host goes from healthy to sick
Sickness comes in the form of losing arms, lose turgor pressure, they become droopy
Potentially Densovius is causal (maybe)
What factors affect the tipping point?
Sampled epidermal biopsy
Hypotheses
Theme:
What is the genetic difference related to susceptibility
What is the genetic basis for disease resistance?
How does the microbiome contribute to seastar wasting?
microbiome differ among disease/symptoms
Is there heritable genetic variation in the microbiome?
Course notes for next week:
put picture of all of our hypotheses on the screen
4 corners of the room, put interests and groups will assemble
give half of class to think through and other half to present ideas
Make students formulate questions, hypotheses, predictions, what samples
Technical questions: construct assembly together? separate? Give students a canonical reference (transcriptome). let students investigate if they want to separate them.
Melissa and Melanie will do the assembly on the side.
AN will be added as adminstrator access (permission changes and stuff); software installed,
Wednesday, hands on logging in on the server, Unix command lines
Look at wiki to see what we can add or subtract in terms how what commands to show
Set up new webpage for course:
. ├── 04_tutorials ├── 2017_Ecological_Genomics.Rproj ├── ANBE_notebook.html ├── ANBE_notebook.md ├── Class_members.Rmd ├── Class_members.html ├── Getting_started.Rmd ├── Getting_started.html ├── Instructors.Rmd ├── Instructors.html ├── Online_notebook.md ├── PBIO.BIO381_Spring2017_syllabus.pdf ├── README.md ├── Tutorials.Rmd ├── Tutorials.html ├── index.Rmd └── index.html 1 directory, 16 files
Melissa gave handoout for a general formula. This may help articulate and motivate and objectives of the eco genom project. this approach can be applied to your other research projects, papers, or grants!
Steve will be writing up projects on the board:
Immune-related gene expression
reverse pathology
looking specific classes of genes
a priori tests for resistance genes
Group(Name = Sherlock): Erin, Sam, Alex, Lauren, Dr. Brody, and
Interested in reverse pathology
What type of pathogens are causing immune response differences?
Viral specific, fungal specific, or bacterial specific
Focus on the transitions: HS, HH, SS
blocking based on time
what is the workflow?
Back up Q: comapre responses to another species.
Use random group of genes to use as housekeeping and control
make heat map or venn diagram of differential gene expression
Intertidal vs subtidal
Genetic differences in susceptibility
local adaptation?
gene expression differences
Group discussion(Name=): Lauren, Laura, Kattia, Dr. K
subtidal more susceptible than intertidal. What differences contribute to that?
Focus on GXP expression between the two groups: ID, genes that are more diff expressed and do a functional enrichment analysis (immune vs general stress response; tease apart handling and susceptibility)
Focus on community structure of micro biome between the two groups: subtotal = more diversity? Equals more resilient to pathogen? Specific taxa associated with disease?
Temporal Variation
chagnes in gene expression through time
changes in microbiome through time
temporal differences in H vs S
Group discussion (Name=):
change of gxp through time: compare HH vs HS (that made it to day 15)
What is causing sickness? and how organisms are affected by it?(this is focus)
stability of gxp for HH vs HS; House keeping genes?
Which ones are varying over time?
functional enrichment for varying or stable through time
Heritability of microbiome
Comparison within the intertidal group
genetic differences ( 3 gropus of individual)
delta in microbiome
Group discussion (Name=Rising starfish ):
Intertidal group only: control for the handling stress
Want to do is access genomic differences (SNP data) between individuals that stayed healthy. HH, vs Sick
Genetic basis with respect to susceptibilty
Look at micro biome: microbiome composition of Healthy vs Sick
Microbiome changes across each time point for all of the individuals
one specific group of microbes that are found in healthy that are lacking in the sick(indicator taxa)
Thinking about all time points(or we can start first day)
How many OTUs? > 100s
Rare taxa might drop out?
Finding diffs
As a group, write up 1 page proposal following the guideline above. Integrate feedback. Due next Monday. We will give additional and formal feedback in the form of writing.
Email, MS word
List all the exact sample names too
For wednesday:
First half- We're going to do a bioblitz of library prep types. Learn more about how libraries are made and what data they produced: WGS, RNA-seq, GBS(rad-seq), Amplicon-seq. No discussion, just 4 INFO UPDATES(20 minutes each).
Second half- logging into the server. Unix command line stuff. Bring your computer. Have software installed (Putty in Windows). File transfer client
Logging into the cluster:
ssh adnguyen@pbio381.uvm.edu
Going in to the root:
[adnguyen@pbio381 ~]$ cd / [adnguyen@pbio381 /]$ ls autorelabel boot dev home lib64 mnt opt root sbin sys users var bin data etc lib media nsr proc run srv tmp usr
Stuff is stored on the data/ directory. Let's look inside
[adnguyen@pbio381 /]$ cd data/ [adnguyen@pbio381 data]$ ls databases packages project_data temp_data example_data popgen scripts users
Notes: Stuff that needs to be installed, goes into the popgen folder
We need to figure out how to do this.
Info-update Blitz
Unix tutorial
Whole genome sequencing
applications
high power and high resolution for pop gen, effective pop size, genetic relatedness, inbreeding, admixture events, conservations(monitoring or control breeding)
new ones too! screen for adaptive potential, inbreeding depression, impacts of genetic variation, plastic responses
Needs $, computational expertise (cluster, command line, python/perl)
Limitations
Prior expectations
Reference genome?
Methods
Platforms:
Knowing your organism!
Genome size! (K-mer approach: unique element of DNA seq length)
Know repetitive content and error rates of sequencing
Degree of Genome duplication
More methods:
Labby stuff
high quality, avoid energetic active tissue (mitochondria may mess up depth)
avoid gut and skin
Quantity: 1 -6 ug(short)
Library prep
Computational stuff
RNA-seq
Advantages:
Limitations
Workflow
Considerations:
prot coding or regulatory non coding?
ref genome?
alt splicing?
tech?
population or treatment specific
Choice of tissue type
Biological replicates!?
Wet lab
Sequencing platforms
Computational
** Amplicon-seq
** (Hannah): 16s rRNA as example - IDs prokaryotes such as bacteria (18s r RNA, fungi, protozoa)
Methods
Library prep
sequencing
data analysis
Applications
Workflow
Applications:
Glossary
Amplicon-seq: targeted approach for analyzing genetic variation in a specifc genomic region
Amplicon: Targeted gene (region) to abe amplified via pca with specific primers
There is a continuum in how much sequence you can get (completeness of samples):
Trade off between sampling and completeness.
GBS falls between RNA-seq and Amplicon-Seq.
Good for population geneticists
Goals:
Genotyping-by-sequencing(GBS) or Restriction Assisted DNA -seqencing (RAD-seq)
Cut up with restriction enzymes,
Sequence with single ends
Cluster (Shared resource):
Logging into the server:
[adnguyen@pbio381 ~]$ pwd
/users/a/d/adnguyen
cp /data/project_data/ssw_samples.txt .
Get the tar and zip file here: https://github.com/trinityrnaseq/trinityrnaseq/releases;
Working directory: /data/popgen
Used "wget"
wget "https://github.com/trinityrnaseq/trinityrnaseq/archive/Trinity-v2.3.2.tar.gz"
gunzip Trinity-v2.3.2.tar.gz tar -xvf Trinity-v2.3.2.tar
cd'd into triniy directory and use "make"
make plugins
Check to see if it works
cd sample_data/test_Trinity_Assembly/ ./runMe.sh
Pre-overall workflow:
Processing Workflow:
Title: Divergent transcriptional responses to low temperature among populations of alpine and lowland species of New Zealand stick insects (Micrarchus).
Shared response to cold shock? Stick insects in new zealand. They all responded to cold shock differently through differential gene expression (~ 2000 unique genes).
Hypothesis: We hypothesize that species with poor dispersal ability are likely to have strong phylogeographic structure as a result of genetic drift and, possibly, local adaptation.
They're poor dispersers, so that you'd expect for them to be locally adapted, particularly to the environment.
Approach:
What is phred quality? Quality score on the propbability of the error. >30 chance it is "right".
What are unigenes? Trinity components containing clusters of ‘contigs’ representing splice variants of the same locus.
CD to path
cd project_data/fastq/
Files:
ls
07_5-08_S_1_R1.fq.gz 27_5-11_H_0_R2.fq.gz
07_5-08_S_1_R2.fq.gz 27_5-14_H_0_R1.fq.gz
07_5-11_S_4_R1.fq.gz 27_5-14_H_0_R2.fq.gz
07_5-11_S_4_R2.fq.gz 27_5-20_H_0_R1.fq.gz
08_5-11_S_1_R1.fq.gz 27_5-20_H_0_R2.fq.gz
08_5-11_S_1_R2.fq.gz 28_5-08_S_1_R1.fq.gz
08_5-14_S_1_R1.fq.gz 28_5-08_S_1_R2.fq.gz
08_5-14_S_1_R2.fq.gz 28_5-11_S_1_R1.fq.gz
09_5-08_H_0_R1.fq.gz 28_5-11_S_1_R2.fq.gz
09_5-08_H_0_R2.fq.gz 28_5-17_S_2_R1.fq.gz
09_5-14_S_2_R1.fq.gz 28_5-17_S_2_R2.fq.gz
09_5-14_S_2_R2.fq.gz 29_5-08_S_2_R1.fq.gz
10_5-08_H_0_R1.fq.gz 29_5-08_S_2_R2.fq.gz
10_5-08_H_0_R2.fq.gz 29_5-11_S_2_R1.fq.gz
10_5-11_H_0_R1.fq.gz 29_5-11_S_2_R2.fq.gz
10_5-11_H_0_R2.fq.gz 29_5-14_S_2_R1.fq.gz
10_5-20_S_2_R1.fq.gz 29_5-14_S_2_R2.fq.gz
10_5-20_S_2_R2.fq.gz 31_6-21_H_0_R1.fq.gz
15_5-17_S_3_R1.fq.gz 31_6-21_H_0_R2.fq.gz
15_5-17_S_3_R2.fq.gz 32_6-15_H_0_R1.fq.gz
19_5-11_H_0_R1.fq.gz 32_6-15_H_0_R2.fq.gz
19_5-11_H_0_R2.fq.gz 32_6-18_H_0_R1.fq.gz
19_5-17_H_0_R1.fq.gz 32_6-18_H_0_R2.fq.gz
19_5-17_H_0_R2.fq.gz 32_6-21_H_0_R1.fq.gz
19_5-20_S_5_R1.fq.gz 32_6-21_H_0_R2.fq.gz
19_5-20_S_5_R2.fq.gz 33_6-12_H_0_R1.fq.gz
20_5-14_H_0_R1.fq.gz 33_6-12_H_0_R2.fq.gz
20_5-14_H_0_R2.fq.gz 34_6-12_H_0_R1.fq.gz
22_5-08_S_1_R1.fq.gz 34_6-12_H_0_R2.fq.gz
22_5-08_S_1_R2.fq.gz 34_6-18_H_0_R1.fq.gz
22_5-11_S_1_R1.fq.gz 34_6-18_H_0_R2.fq.gz
22_5-11_S_1_R2.fq.gz 35_6-15_H_0_R1.fq.gz
23_5-17_S_2_R1.fq.gz 35_6-15_H_0_R2.fq.gz
23_5-17_S_2_R2.fq.gz 35_6-18_H_0_R1.fq.gz
24_5-08_H_0_R1.fq.gz 35_6-18_H_0_R2.fq.gz
24_5-08_H_0_R2.fq.gz 36_6-12_S_1_R1.fq.gz
24_5-14_H_0_R1.fq.gz 36_6-12_S_1_R2.fq.gz
24_5-14_H_0_R2.fq.gz 36_6-15_S_2_R1.fq.gz
24_5-17_H_0_R1.fq.gz 36_6-15_S_2_R2.fq.gz
24_5-17_H_0_R2.fq.gz 36_6-18_S_3_R1.fq.gz
24_5-20_H_0_R1.fq.gz 36_6-18_S_3_R2.fq.gz
24_5-20_H_0_R2.fq.gz 38_6-18_S_2_R1.fq.gz
26_5-08_S_2_R1.fq.gz 38_6-18_S_2_R2.fq.gz
26_5-08_S_2_R2.fq.gz 38_6-21_H_0_R1.fq.gz
26_5-11_S_3_R1.fq.gz 38_6-21_H_0_R2.fq.gz
26_5-11_S_3_R2.fq.gz 38_6-24_S_5_R1.fq.gz
27_5-08_H_0_R1.fq.gz 38_6-24_S_5_R2_fastqc.html
27_5-08_H_0_R2.fq.gz 38_6-24_S_5_R2_fastqc.zip
27_5-11_H_0_R1.fq.gz 38_6-24_S_5_R2.fq.gz
Each class member will do R1 and R2(left and right reads)
number relates to scale of symptoms
Samples I'm doing!
20_5-14_H_0_R1.fq.gz & 205-14_H_0_R2.fq.gz
look at our files:
zcat 20_5-14_H_0_R1.fq.gz | head
Output:
@J00160:63:HHHT2BBXX:1:1101:27823:1244 1:N:0:TCCGGAGA+AGGCTATA
GNGCGTTATTATATGGTTTTATCTTCATTTNTTAAATGAACTTGATCTTGAATTTTTTTTTTTTTTTTTTTGGGGGATCGGAAGAGCACACGTNTGAACTC
+
A#AFFAFJJJJJJJJJJJJJJJJJJJJJJJ#JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAAFFF-A<JFAJF<JJ-FJJJF#JF-J77A
@J00160:63:HHHT2BBXX:1:1101:28635:1244 1:N:0:TCCGGAGA+AGGCTATA
ANTGAGTAGAAGGAATCGGTCCACCATAAANAAGTGGAGGTTCCACATGGGCAAAGATGCCGGTACCATTCTTAACACTAGAAGAAGGAGCTTTTTCACTA
+
A#AFFJJJJJJJJJJJJJJJJJJJJJJJJJ#JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJJAFJJJJJF
@J00160:63:HHHT2BBXX:1:1101:29244:1244 1:N:0:TCCGGAGA+AGGCTATA
GAGGCACTCATACAGGTTACACAGCTGAGANTAATTTATATCATATACTATAATGCATAATACATGTAAGCATCTCTATTGCTACATTGCCTGGTTATACA
This is a fastq file
move into my directory and save the scripts into the scripts folder
cd scripts/
cp /data/scripts/trim_example.sh .
#!/bin/bash java -classpath /data/popgen/Trimmomatic-0.33/trimmomatic-0.33.jar org.usadellab.trimmomatic.TrimmomaticPE \ -threads 1 \ -phred33 \ /data/project_data/fastq/20_5-14_H_0_R1.fq.gz \ /data/project_data/fastq/20_5-14_H_0_R2.fq.gz \ /data/project_data/fastq/cleanreads/"20_5-14_H_0_R1_clean_paired.fa" \ /data/project_data/fastq/cleanreads/"20_5-14_H_0_R1_clean_unpaired.fa" \ /data/project_data/fastq/cleanreads/"20_5-14_H_0_R2_clean_paired.fa" \ /data/project_data/fastq/cleanreads/"20_5-14_H_0_R2_clean_unpaired.fa" \ ILLUMINACLIP:/data/popgen/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:2:30:10 \ LEADING:28 \ TRAILING:28 \ SLIDINGWINDOW:6:28 \ HEADCROP:9 \ MINLEN:35 \
./trim_example.sh &
Output
Input Read Pairs: 13876156 Both Surviving: 11211956 (80.80%) Forward Only Surviving: 2046413 (14.75%) Reverse Only Surviving: 257037 (1.85%) Dropped: 360750 (2.60%) TrimmomaticPE: Completed successfully
Students may benefit from instructor led demonstration of implementing an electronic notebook.
Give more clear instructions for how journal club leaders should lead a paper.
Transcriptomics take 2:
Intro
studying relationship between organism and environment
adding genomics you can look at this at a finer scale
observe rapid responses to the environment (gene expression)
advantages: can study wild systems (non-models; or non-traditional)
find silent genes that get expressed under novel stimuli
variation between individuals, within and among populations,
novel transcripts without homologues in closely related organisms
Main methods:
Breif overview
Main questions
How much variation is there in gene expression and how is it structured? (288 studies)
Heritable variation
Epigenetics
Wild populations
Qst-Fst comparisons
eQTL
Macroevolution
How do environmental stimuli affect gene expression? ( 136)
abiotic stress
environmental heterogeneity
host-parasite interactions
selective abiotic and biotic interactions
molecular basis of response
drawbacks : need to flash freeze and the transcriptome is just a snapshot
time course analyses
how does gene expression affect phenotype? (15 studies)
alternative phenotypes
move from correlation to causation
future directions
combined microarrays and rna-seq
Database for proposed ecologicla variations
Problems:
Glossary:
Background: Desert drosophila that grows in arid conditions. It has 2 hosts. The two cactus differ in alkyloids.
Q: How does environment influence gene expression?
Learning goals (skills):
scripts
paths
moving through the directories
moving files
Practical:
Moved the trimmomatic files into this path:
/users/a/d/adnguyen/mydata/2017-02-08_cleanread
Ok now, run fastqc to check files
fastqc 20*
What do I get?
[adnguyen@pbio381 2017-02-08_cleanreads]$ ls 20_5-14_H_0_R1_clean_paired.fa 20_5-14_H_0_R1_clean_paired.fa_fastqc.html 20_5-14_H_0_R1_clean_paired.fa_fastqc.zip 20_5-14_H_0_R1_clean_unpaired.fa 20_5-14_H_0_R1_clean_unpaired.fa_fastqc.html 20_5-14_H_0_R1_clean_unpaired.fa_fastqc.zip 20_5-14_H_0_R2_clean_paired.fa 20_5-14_H_0_R2_clean_paired.fa_fastqc.html 20_5-14_H_0_R2_clean_paired.fa_fastqc.zip 20_5-14_H_0_R2_clean_unpaired.fa 20_5-14_H_0_R2_clean_unpaired.fa_fastqc.html 20_5-14_H_0_R2_clean_unpaired.fa_fastqc.zip
Now, I'll move it back to my computer: So in my home computer, this is my working directory.
/Users/andrewnguyen
Now we can move files from server to my home computer
scp adnguyen@pbio381.uvm.edu:~/mydata/2017-02-08_cleanreads/*.html .
This is what i get
andrewnguyen$ ls 20_5-14_H_0_R1_clean_paired.fa_fastqc.html 20_5-14_H_0_R1_clean_unpaired.fa_fastqc.html 20_5-14_H_0_R2_clean_paired.fa_fastqc.html 20_5-14_H_0_R2_clean_unpaired.fa_fastqc.html Applications Desktop Documents Downloads Dropbox Google Drive Library Movies Music Pictures Public Sites Teaching bower_components node_modules package.json zScience
When doing an assembly. You can concatenate reads and then run fastqc.
What can influence it?
instructor meeting:
There is a catch up day. We're ~ half a day behind.
Melissa make better quality assembly
Melissa (for Monday): finish cleaning all the samples and then make a good reference
Health and sick individuals; 50 million reads
BWA-mem; short mapping (by monday)
for transcriptomics, they can extract read counts ffrom their sam files (Wednesday). (this involves a python script)
reference:
Zhao X, Bergland AO, Behrman EL, Gregory BD, Petrov DA, Schmidt PS. 2016. Global Transcriptional Profiling of Diapause and Climatic Adaptation in Drosophila melanogaster. Mol Biol Evol 33:707–720.
Background + Objecives
Goal is to understand the mechanisms that allow organisms to adapt to distinct environments
Focus: Diapause phenotype
Diapause varies along cline. (90% in temperate, 30% nootropics)
Fundamental lief history trade-off: somatic maintenance and reproduction.
Candidates for diapause:
Prediction:
If the natural variation of diapause does play critical roles in adaptation to environmental heterogeneity, genes differentially regulated as a function of the diapause phenotype are likely under spatially and/or temporally varying selectively pressures, thus genetic polymorphisms on these genes or in vicinity of these genes are likely to show clinal and seasonal patterns as they may have distinct cis-regulatory functions that regulates gene expression in diapausing nondiapausing individuals
Basically: Variation in gene expression associated with diapause should be under selection. One way to visualize this is how these genes vary across space(latitude) and time(season)
Methods
Collected flies from 4 popoulations (northern) east coast us: sampled in october with 50 isofemale lines each orchard
Treatment: reared at 25 C 12L :12D; 11 C 9L:15D
dissected, scored oocytes as :
Measured gxp for heads and ovaries ( 2 tissue types)
Sequencing: 100bps short reads
Used EBSeq (empirical baysian approach) to call genes and isoforms for gene expression. Measured RPKM ( Reads Per Kilobase of transcript per Million mapped reads)
Analaysis: vitellogenesis and oogenesis excluded from analysis.
Location dependent expression: Looked at autocorrelation in the log2 fold changes along each chromosome (used moran's I as the metric of autocorrelation). This was compared to a randomly generated chromosome arm.
Test if DE genes are enriched in season/clinal sets
Fig 1. Shows log2 fold change (D over ND) vs RPKM. So higher values of log2 fold change = diapausing individuals having more gxp. This is a plot to simply show no relationship between how highlly expressed a gene is with expression differences.
thoughts
Bad figure. What happened to the populations? That is the more interesting result.
Table 1. Summary of gene expression level differences
Number of genes that are differentially expressed between D and ND for head or ovary.
Table 2. Isoform level differential expression
Isoforms diff expressed between D and ND for head or ovary.
Fig2. Venn diagram of overlap between DE of genes or isoforms for head and ovary.
Table 3. Enriched KEGG pathways.
Head has more pathways than ovaries
Table 4. Gene-level expression
Candidate genes were not differentially expressed. Why is this?
Table 5. Transcript level expression
Some of the cp, tim, and dp110 isoforms are differentially expressed.
Fig3. Density as a function of moran's I (autocorrelation) for each chromosome and each heard or ovary.
Table 6. Summary of overlapping genes DDE and clinal /seasonal genes
Fig 4. Enrichment of DE genes that are clinal are seasonal.
Only downregulated in the head were enriched under clinal and seasonal.
Take home
Glossary:
Background
Issues
under utiilization of biological replicates
requiring independent library preparations
doesn't include pooled samples (unless pooled samples that were replicated)
wide dynamic range can make it noisy
R exercise
Gerenal rules of thumb
more biological replicates more than increasing depth
sequence more than 10 reads of depth per transcript
Use at least 3 biological replicates per condition
conduct a pilot experiment
answer 2 questions:
workflow:
clean adn evaluate reads (fastq)
Make and evaluate transcriptome assembly (fasta)
Map cleaned reads to transcriptome assembly (.sam files)
Extract data:
Transdecoder predicts open reading frames.
cp /data/scripts/bwaaln.sh .
#!/bin/bash # To run from present directory and save output: ./bwaaln.sh > output.bwaaln.txt myLeft='20_5-14_H_0_R1_clean_paired.fa' echo $myLeft myRight=${myLeft/_R1.fq.gz_left/_R2.fq.gz_right} echo $myRight myShort=`echo $myLeft | cut -c1-11` echo $myShort # bwa index /data/project_data/assembly/longest_orfs.cds # This only needs to be done once on the reference bwa aln /data/project_data/assembly/longest_orfs.cds /data/project_data/fastq/cleanreads/$myLeft > $myLeft".sai" bwa aln /data/project_data/assembly/longest_orfs.cds /data/project_data/fastq/cleanreads/$myRight > $myRight".sai" bwa sampe -r '@RG\tID:'"$myShort"'\tSM:'"$myShort"'\tPL:Illumina' \ -P /data/project_data/assembly/longest_orfs.cds $myLeft".sai" $myRight".sai" \ /data/project_data/fastq/cleanreads/$myLeft \ /data/project_data/fastq/cleanreads/$myRight > $myShort"_bwaaln.sam"
bwa aln Usage: bwa aln [options] <prefix> <in.fq> Options: -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04] -o INT maximum number or fraction of gap opens [1] -e INT maximum number of gap extensions, -1 for disabling long gaps [-1] -i INT do not put an indel within INT bp towards the ends [5] -d INT maximum occurrences for extending a long deletion [10] -l INT seed length [32] -k INT maximum differences in the seed [2] -m INT maximum entries in the queue [2000000] -t INT number of threads [1] -M INT mismatch penalty [3] -O INT gap open penalty [11] -E INT gap extension penalty [4] -R INT stop searching when there are >INT equally best hits [30] -q INT quality threshold for read trimming down to 35bp [0] -f FILE file to write output to instead of stdout -B INT length of barcode -L log-scaled gap penalty for long deletions -N non-iterative mode: search for all n-difference hits (slooow) -I the input is in the Illumina 1.3+ FASTQ-like format -b the input read file is in the BAM format -0 use single-end reads only (effective with -b) -1 use the 1st read in a pair (effective with -b) -2 use the 2nd read in a pair (effective with -b) -Y filter Casava-filtered sequences
Run default parameters
Should end up with .sai
files!!!
output:
ls 20_5-14_H_0_bwaaln.sam bwaaln.sh 20_5-14_H_0_R1_clean_paired.fa.sai trim_example.sh 20_5-14_H_0_R2_clean_paired.fa.sai
head 20_5-14_H_0_bwaaln.sam @SQ SN:TRINITY_DN37_c0_g1::TRINITY_DN37_c0_g1_i1::g.1::m.1 LN:303 @SQ SN:TRINITY_DN120_c0_g2::TRINITY_DN120_c0_g2_i1::g.2::m.2 LN:381 @SQ SN:TRINITY_DN125_c0_g1::TRINITY_DN125_c0_g1_i1::g.3::m.3 LN:642 @SQ SN:TRINITY_DN125_c0_g2::TRINITY_DN125_c0_g2_i1::g.4::m.4 LN:528 @SQ SN:TRINITY_DN159_c0_g1::TRINITY_DN159_c0_g1_i1::g.5::m.5 LN:696 @SQ SN:TRINITY_DN159_c0_g1::TRINITY_DN159_c0_g1_i1::g.6::m.6 LN:396 @SQ SN:TRINITY_DN191_c0_g1::TRINITY_DN191_c0_g1_i1::g.7::m.7 LN:309 @SQ SN:TRINITY_DN192_c0_g1::TRINITY_DN192_c0_g1_i1::g.8::m.8 LN:318 @SQ SN:TRINITY_DN192_c0_g2::TRINITY_DN192_c0_g2_i1::g.9::m.9 LN:321 @SQ SN:TRINITY_DN293_c0_g1::TRINITY_DN293_c0_g1_i1::g.10::m.10 LN:315
vim tail_version_sam.txt
J00160:63:HHHT2BBXX:4:2228:14296:49089 77 * 0 0 * CTCTGCCCCGACGGCCGGGTATAGGCGGCACGCTCAGCGCCATCCATTTTCAGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACACTCCT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFFJJJJJJFFJJJJJJJJ7FJJJJJJJJJFJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:14296:49089 141 * 0 0 * TCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCTGAAAATGGATGGCGCTGAGCGTGCCGCCTATACCCGGC FJJJAJJJJJJJJFJJFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ<FJJJJ-FJJJJJJJJJJAJJJJJJJFJJJJJAJJJJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:15026:49089 77 * 0 0 * TTTTTCGTCACTACCTCCCCGTGTCGGGAGTGGGTAATTTGCGCGCCTGCTGCCTTCCTTGGATGTGGTAGCCGTTTCTCAAGCTCCCTC JJJJJJJJFJJJJFJAJJJJFAJJJAJJAJ7FJJ<AJJFJFAJFAFJ<<JFAJJFJJJJJAF<AJFFJ-FJFJJFJFAJJJJFJJJFJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:15026:49089 141 * 0 0 * GGTTCGATTCCGGAGAGGGAGCTTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACACGGGGAGG JJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:22008:49089 83 TRINITY_DN30310_c1_g10::TRINITY_DN30310_c1_g10_i1::g.7248::m.7248 168 60 91M = 70 -189 CCTCGCTCCCCGGGCGAAAGGGAATCGGGTCAATATTCCCGAACCCGGAGGCGGAGCCCTCCGTCTTCGTGGCGGTCCGAGCTGTAAAGCG JJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:20_5-14_H_0 XT:A:U NM:i:SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:91 J00160:63:HHHT2BBXX:4:2228:22008:49089 163 TRINITY_DN30310_c1_g10::TRINITY_DN30310_c1_g10_i1::g.7248::m.7248 70 60 91M = 168 189 CCATGTGAACAGCAGTTGTACATGGGTCAGTCGATCCTAAGCCCCAGGGAAGTTCCGCTCCGAGCGGAGGCGGGCGCCCCTCTCCATGTGA FJJJJJJJJJJJJJJFFJJJJJJJJJFJJJJJJJJJJJFFJJJJJJJJJJFJFJJJJJJFFJJJJJJJJJJJJJJJJJJJFJJJJJJJFJA RG:Z:20_5-14_H_0 XT:A:U NM:i:SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:91 J00160:63:HHHT2BBXX:4:2228:24647:49089 77 * 0 0 * ACGGGCGATGTGTGCGCATTCTAGGGCTTTGAGTTGTTCATGGGCATTTTCTTTTGCTCATTACTGCTGAATCCTGTTTCAAATGGGGCTA JFJJJJFFJJJJJJJJJFJJJJJJJJJJFJJAJJJJJJJFJJJJJJJFJJJJJJJJJF<JJJJJJJFAFJJFFFJJJJJJJJJJJJJ<JJA RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:24647:49089 141 * 0 0 * GGATAAGTGAGCTACAATCATAAATATAAGAATAAAAATATGTATGAATAATGAACTGATAGCCCCATTTGAAACAGGATTCAGCAGTAAT AAJJJJ<JJJJFFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFFJJJJJJJJJJ<-FFFFFAFFFAFJJJJJJJJJJJFAJFJJFJJJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:30168:49089 77 * 0 0 * TCTTGAAATCTGTGGGTTTCTCGTATAGTTCAATTACAACAGGTCCTGGTTTCAACTCGTCCATTTCCATGAAGGCAAAACACTTGGTGCT JJJFFJJJJJJJJJJJAJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJFJJJFFAFAJFJJJJFFFJJJJFFJJJJFJJJFJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:30168:49089 141 * 0 0 * GAGTTTAAGCATTTCAAAGTGAAAAAGCGCACTATCAGCACCAAGTGTTTTGCCTTCATGGAAATGGACGAGTTG JJJJJJJJJJJJJFJJAJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAF RG:Z:20_5-14_H_0
Melissa code hack suggestion:
screen
bash bwaaln.sh
Drawbacks:
Ways to get around this: SNP data from expressed sequences.
AAATGC
AACTGC
Glossary:
SNPs- single nucleotide polymorphism
Indels - insertion deletions.
Fat: % of genetic material explaiend by differences among populations
Workflow:
Tissue: breadth of tissue, dev stages
pool and sequence, create sequencel libraries; ~30-100 million paired end long reads
process raw sequence data for quality and errors
digital normalization of sequence data
assemble your cleaned paired end reads
prune assembled transcripts; remove DNA contamination, non coding RNA and gene fragments
Use reference genome for assembly; or use COGs (conserved orthologue genomes) Assembly evaluation
SNP detection:
Validate SNPs
Practical Application
General approaches:
Using outliers
Non-outlier approach: test high Fst loci for other features associated with selection
The beginning for understanding how selection can act on populations!
check your SAM file
grep -c XT:A:U 20_5-14_H_0_bwaaln.sam 1,737,572
grep -c X0:i:1 20_5-14_H_0_bwaaln.sam 1747040
You can check a number of other elements, total number of reads, search for the various flags…
Count reads from sam file with python script
copy file script: cp /data/scripts/countxpression_PE.py .
Running script: python countxpression_PE.py 20 35 countstats.txt ../results/20_5-14_H_0_bwaaln.sam
ERROR:
Traceback (most recent call last): File "countxpression_PE.py", line 159, in <module> countxpression(file, sys.argv[1], sys.argv[2], 0, 'text', file[:-4]+'_counts.txt', sys.argv[3]) File "countxpression_PE.py", line 68, in countxpression contigs[cols[2]][1]+=1 #add a count to that contig for the 'top' hit for reads that optimally align to multiple contigs (all the other good hits are often not printed with current bwa settings) KeyError: 'TRINITY_DN30437_c4_g9::TRINITY_DN30437_c4_g9_i6::g.7687::m.7687'
the names are too long; need a regular expression to find and replace
sed -i 's/::/\_/g' 20_5-14_H_0_bwaaln.sam
The s
is search
the /::/
find the colon
\_
replace
g
= globally
output
head 20_5-14_H_0_bwaaln.sam @SQ SN:TRINITY_DN37_c0_g1_TRINITY_DN37_c0_g1_i1_g.1_m.1 LN:303 @SQ SN:TRINITY_DN120_c0_g2_TRINITY_DN120_c0_g2_i1_g.2_m.2 LN:381 @SQ SN:TRINITY_DN125_c0_g1_TRINITY_DN125_c0_g1_i1_g.3_m.3 LN:642 @SQ SN:TRINITY_DN125_c0_g2_TRINITY_DN125_c0_g2_i1_g.4_m.4 LN:528 @SQ SN:TRINITY_DN159_c0_g1_TRINITY_DN159_c0_g1_i1_g.5_m.5 LN:696 @SQ SN:TRINITY_DN159_c0_g1_TRINITY_DN159_c0_g1_i1_g.6_m.6 LN:396 @SQ SN:TRINITY_DN191_c0_g1_TRINITY_DN191_c0_g1_i1_g.7_m.7 LN:309 @SQ SN:TRINITY_DN192_c0_g1_TRINITY_DN192_c0_g1_i1_g.8_m.8 LN:318 @SQ SN:TRINITY_DN192_c0_g2_TRINITY_DN192_c0_g2_i1_g.9_m.9 LN:321 @SQ SN:TRINITY_DN293_c0_g1_TRINITY_DN293_c0_g1_i1_g.10_m.10 LN:315
tail 20_5-14_H_0_bwaaln.sam J00160:63:HHHT2BBXX:4:2228:14296:49089 77 * CTCTGCCCCGACGGCCGGGTATAGGCGGCACGCTCAGCGCCATCCATTTTCAGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACACTCCT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFFJJJJJJFFJJJJJJJJ7FJJJJJJJJJFJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:14296:49089 141 * TCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCTGAAAATGGATGGCGCTGAGCGTGCCGCCTATACCCGGC FJJJAJJJJJJJJFJJFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ<FJJJJ-FJJJJJJJJJJAJJJJJJJFJJJJJAJJJJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:15026:49089 77 * TTTTTCGTCACTACCTCCCCGTGTCGGGAGTGGGTAATTTGCGCGCCTGCTGCCTTCCTTGGATGTGGTAGCCGTTTCTCAAGCTCCCTC JJJJJJJJFJJJJFJAJJJJFAJJJAJJAJ7FJJ<AJJFJFAJFAFJ<<JFAJJFJJJJJAF<AJFFJ-FJFJJFJFAJJJJFJJJFJJF RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:15026:49089 141 * GGTTCGATTCCGGAGAGGGAGCTTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACACGGGGAGG JJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:22008:49089 83 TRINITY_DN30310_c1_g10_TRINITY_DN30310_c1_g10_i1_g.7248_m.7248 168 60 91M = 70 -189CCTCGCTCCCCGGGCGAAAGGGAATCGGGTCAATATTCCCGAACCCGGAGGCGGAGCCCTCCGTCTTCGTGGCGGTCCGAGCTGTAAAGCG JJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:20_5-14_H_0 XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:91 J00160:63:HHHT2BBXX:4:2228:22008:49089 163 TRINITY_DN30310_c1_g10_TRINITY_DN30310_c1_g10_i1_g.7248_m.7248 70 60 91M = 168 189 CCATGTGAACAGCAGTTGTACATGGGTCAGTCGATCCTAAGCCCCAGGGAAGTTCCGCTCCGAGCGGAGGCGGGCGCCCCTCTCCATGTGA FJJJJJJJJJJJJJJFFJJJJJJJJJFJJJJJJJJJJJFFJJJJJJJJJJFJFJJJJJJFFJJJJJJJJJJJJJJJJJJJFJJJJJJJFJA RG:Z:20_5-14_H_0 XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:91 J00160:63:HHHT2BBXX:4:2228:24647:49089 77 * ACGGGCGATGTGTGCGCATTCTAGGGCTTTGAGTTGTTCATGGGCATTTTCTTTTGCTCATTACTGCTGAATCCTGTTTCAAATGGGGCTA JFJJJJFFJJJJJJJJJFJJJJJJJJJJFJJAJJJJJJJFJJJJJJJFJJJJJJJJJF<JJJJJJJFAFJJFFFJJJJJJJJJJJJJ<JJA RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:24647:49089 141 * GGATAAGTGAGCTACAATCATAAATATAAGAATAAAAATATGTATGAATAATGAACTGATAGCCCCATTTGAAACAGGATTCAGCAGTAAT AAJJJJ<JJJJFFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFFJJJJJJJJJJ<-FFFFFAFFFAFJJJJJJJJJJJFAJFJJFJJJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:30168:49089 77 * TCTTGAAATCTGTGGGTTTCTCGTATAGTTCAATTACAACAGGTCCTGGTTTCAACTCGTCCATTTCCATGAAGGCAAAACACTTGGTGCT JJJFFJJJJJJJJJJJAJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJFJJJFFAFAJFJJJJFFFJJJJFFJJJJFJJJFJJ RG:Z:20_5-14_H_0 J00160:63:HHHT2BBXX:4:2228:30168:49089 141 * GAGTTTAAGCATTTCAAAGTGAAAAAGCGCACTATCAGCACCAAGTGTTTTGCCTTCATGGAAATGGACGAGTTG JJJJJJJJJJJJJFJJAJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAF RG:Z:20_5-14_H_0
rerun python countxpression_PE.py 20 35 countstats.txt ../results/20_5-14_H_0_bwaaln.sam &
output;
Filename Total#Reads NumContigsMatched NumUnaligned NumAligned NumMultiAligned NumSingleAligned NumQualSingles Wrongformat PropQualAligned ../results/20_5-14_H_0_bwaaln.sam 22423912 1971 19988441 2435471 697899 1737572 1735825 0 0.077
Notes:
Glossary:
haplotypes: tightly linked genes
KOG:
scp adnguyen@pbio381.uvm.edu:/data/project_data/DGE/* .
2017-02-22_DEseq2_tutorial.Rmd countsdata_trim.txt
DESeq2_exploreSSW_trim.R countstatsummary.txt
cols_data_trim.txt explore_expression_data.R
Data and script files
glossary:
Coalescent-
Reticulation
purifying/background selection-
gene trees vs species-
recombination-
introgression-
incomplete lineage sorting. population size and time.
large population size and short time allows to undergo incomplete lineage sorting to be represented in a divergence event.
t/Ne
large t/Ne = resemble species tree
small t/Ne = Incomplete lineage sorting
SKeller: genomes are collections of evolutionary history (think of the problem as a sampling problem)
Mpespeni: Is there a saturation such athat when you sample, the more sample doesn't improve the species tree?
Edwards: 5-10 loci is generally "good enough" to resolve species.
Assumption: we can estimate these gene trees accurately. Rapid radiations might be tough to figure out what the gene trees actually are.
Do we want to estimate the trees? Other statistics? Edwards likes gene trees. Hard to estimate gene trees from a SNP. How granular do you want to analyze the data?
Kilpatrick: Radiations deep or recent in time?
Edwards: Can detect heterogeneity in deep rapid radiations in one of his papers. You can have short branch lengths that happened a long time ago. Can you tell? Yes, with the right type of data.
Recombination! Another kind of reticulation: Causes gene boundaries to become blurry.
Hudson's MS Check this program out for coalescent modeling for SNP data.
move new data
scp adnguyen@pbio381.uvm.edu:/data/project_data/DGE/* .
Deseq2
Your models
Cool way to sample a dataframe (I always forget)
dds <- dds[sample(nrow(dds), 1200), ]
dim(dds)
>[1] 1200 77
So we randomly sampled a subset (10%) so we can construct models and run it more quickly.
Sort dataset based on p-values
res <- res[order(res$padj),]
Glossary:
Paralog: Gene duplicate
pie: pairwise nucleotide diversity
SFS: Site frequency spectrum= histogram of allele frequencies
Ne = Effective population size
Skeller info update: Population genomics:
Questions:
What shapes population structure?
Diversity within and among populations
What is the effect of selection? : synonymous and nonsynonymous mutations
How does recombination promote diversity?
Pipeline for analysis
Sequencing Error (Illumina 1%) ; 1 in 100 there is a sequencing error
Calling homozygous (AA), heterozygous (AT), homozygous (TT)? It is a multinomial distribution
If genotype is AT heterozygous, predict A = T = 0.5
use multinomial dsitribution to say how likely it is a heterozygoute. Each genotypic class has a probability
genotype | probability (referred to as genotype likelihood or genotype prob) |
---|---|
AA | low |
AT | high |
TT | low |
Bayesian folks: Use those probabilities and propogate those uncertainties to further analyses. (Not common but field is going that way)
Another problem: paralogues; duplicated gene on separate genome or in synteny
= expected heterozygosity
Reference free population genomics study
Expectation:
Our expectation is that small-Ne species should show a lower pN, a lower pS, and a higher pN/pS ratio than large-Ne species. This is because genetic drift, which is enhanced in small populations, is expected to reduce the neutral and selected levels of genomic diversity, but to increase the relative probability of slightly deleterious, non-synonymous mutations (relatively to neutral, synonymous mutations) segregating at observable frequency.
Where do mutations happen? During meiosis. In a big population there is more meosis. Mice in the sewers of NYC compared to a black rhino populations. They are no where close!
Paralogue filtering: nothing to map to so it is hard to ID'ing a paralogue.
Figure 2: SFS differs in test method than standard method (samtools). New method can id more rare alleles. New method can call heterozygotes is better. Rare alleles occur more often in heterozygote individuals.
Mistake: pseudoreplication in calling SNPs because multiple individuals are represented under varying conditions. So we need to analyze and call SNPs at the individual level and not smaple level (93)
Approach to fix:
use vcftools:
$ vcftools
VCFtools (0.1.14)
© Adam Auton and Anthony Marcketta 2009
Process Variant Call Format files
For a list of options, please go to:
https://vcftools.github.io/man_latest.html
Alternatively, a man page is available, type:
man vcftools
Questions, comments, and suggestions should be emailed to:
vcftools-help@lists.sourceforge.net
move into directory with data:
cd /data/project_data/snps/reads2snps/
check out a portion of the vcf file that has called SNPs
head head_SSW_bamlist.txt.vcf ##fileformat=VCFv4.0 ##source=reads2snp ##phasing=unphased ##FILTER= <ID=multi,Description="At least one individual shows more than 2 alleles"> ##FILTER= <ID=unres,Description="Insufficient sequencing depth or uncertain genotyping in all individuals"> ##FILTER= <ID=para,Description="Suspicion of paralogy"> ##INFO= <ID=N,Number=1,Type=Integer,Description="Number of genotyped individuals"> ##FORMAT= <ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03_5-08_S_2 07_5-08_S_1 08_5-08_H_0 09_5-08_H_0 10_5-08_H_0 14_5-08_S_2 15_5-08_H_0 19_5-11_H_0 20_5-08_H_0 22_5-08_S_1 23_5-17_S_2 24_5-08_H_0 26_5-08_S_2 27_5-08_H_0 28_5-08_S_1 29_5-08_S_2 31_6-12_H_0 32_6-12_H_0 33_6-12_H_0 34_6-12_H_0 35_6-12_H_0 36_6-12_S_1 37_6-12_H_0 38_6-12_H_0 TRINITY_DN47185_c0_g1_TRINITY_DN47185_c0_g1_i2_g.24943_m.24943 1 . unres N=0 GT .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|. .|.
unres = unresolved.
depth filter of 10 reads per genotype call and filters out genotype prob calls less than prob of 95%.
Summarize data vcf file for us:
vcftools --vcf SSW_bamlist.txt.vcf
output
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf SSW_bamlist.txt.vcf
After filtering, kept 24 out of 24 Individuals
After filtering, kept 7472775 out of a possible 7472775 Sites
Run Time = 22.00 seconds
Check number of individuals. Some are not real.
Finding the number of unresolved:
grep "unres" SSW_bamlist.txt.vcf | wc
5631864 185851488 1028494934
5.6 million are unresolved!!
Check for paralogy
grep "para" SSW_bamlist.txt.vcf | wc
4354 143652 795592
4,354 incidence of paralogy
1.8 million SNPs left!!!
More filtering:
$ vcftools --vcf SSW_bamlist.txt.vcf --min-alleles 2 --max-alleles 2 VCFtools - 0.1.14 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf SSW_bamlist.txt.vcf --max-alleles 2 --min-alleles 2 After filtering, kept 24 out of 24 Individuals After filtering, kept 20319 out of a possible 7472775 Sites Run Time = 19.00 seconds
1/48 = 0.02
$ vcftools --vcf SSW_bamlist.txt.vcf --maf 0.02
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf SSW_bamlist.txt.vcf
--maf 0.02
After filtering, kept 24 out of 24 Individuals
After filtering, kept 5656584 out of a possible 7472775 Sites
Run Time = 70.00 seconds
retained 5.6 million SNPs
$ vcftools --vcf SSW_bamlist.txt.vcf --max-missing 0.80
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf SSW_bamlist.txt.vcf
--max-missing 0.8
After filtering, kept 24 out of 24 Individuals
After filtering, kept 100219 out of a possible 7472775 Sites
Run Time = 68.00 seconds
OK! Now we can combine filters! and output it into a file
$ vcftools --vcf SSW_bamlist.txt.vcf --min-alleles 2 --max-alleles 2 --maf 0.02 --max-missing 0.8 --recode --out ~/mydata/2017-03-06_SNPcallfilter
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf SSW_bamlist.txt.vcf
--maf 0.02
--max-alleles 2
--min-alleles 2
--max-missing 0.8
--out /users/a/d/adnguyen/mydata/2017-03-06_SNPcallfilter
--recode
After filtering, kept 24 out of 24 Individuals
Outputting VCF file...
Output of file
2017-03-06_SNPcallfilter.log
2017-03-06_SNPcallfilter.recode.vcf
What the files look like:
head 2017-03-06_SNPcallfilter.recode.vcf
##fileformat=VCFv4.0
##source=reads2snp
##phasing=unphased
##FILTER= <ID=multi,Description="At least one individual shows more than 2 alleles">
##FILTER= <ID=unres,Description="Insufficient sequencing depth or uncertain genotyping in all individuals">
##FILTER= <ID=para,Description="Suspicion of paralogy">
##INFO= <ID=N,Number=1,Type=Integer,Description="Number of genotyped individuals">
##FORMAT= <ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03_5-08_S_2 07_5-08_S_1 08_5-08_H_0 09_5-08_H_0 10_5-08_H_0 14_5-08_S_2 15_5-08_H_0 19_5-11_H_0 20_5-08_H_0 22_5-08_S_1 23_5-17_S_2 24_5-08_H_0 26_5-08_S_2 27_5-08_H_0 28_5-08_S_1 29_5-08_S_2 31_6-12_H_0 32_6-12_H_0 33_6-12_H_0 34_6-12_H_0 35_6-12_H_0 36_6-12_S_1 37_6-12_H_0 38_6-12_H_0
TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 . PASS . GT 0|0 0|0 0|0 0|0 . 0|0 0|1 0|0 0|0 0|0 0|0 . 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 . 0|0 0|0
Looking at hardy weinburg equilibrium
$ vcftools --vcf 2017-03-06_SNPcallfilter.recode.vcf --hardy
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf 2017-03-06_SNPcallfilter.recode.vcf
--hardy
After filtering, kept 24 out of 24 Individuals
Outputting HWE statistics (but only for biallelic loci)
HWE: Only using fully diploid SNPs.
After filtering, kept 1180 out of a possible 1180 Sites
Run Time = 0.00 seconds
stored as:
out.hwe
out.log
What out.hwe would look like:
head out.hwe
CHR POS OBS(HOM1/HET/HOM2) E(HOM1/HET/HOM2) ChiSq_HWE P_HWE P_HET_DEFICIT P_HET_EXCESS
TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 4566 23/1/0 23.01/0.98/0.011.086464e-02 1.000000e+00 1.000000e+00 1.000000e+00
TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 4665 21/3/0 21.09/2.81/0.091.066667e-01 1.000000e+00 1.000000e+00 9.361702e-01
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 978 23/1/0 23.01/0.98/0.011.086464e-02 1.000000e+00 1.000000e+00 1.000000e+00
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 1404 23/1/0 23.01/0.98/0.011.086464e-02 1.000000e+00 1.000000e+00 1.000000e+00
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 1722 20/4/0 20.17/3.67/0.171.983471e-01 1.000000e+00 1.000000e+00 8.737589e-01
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 3426 23/1/0 23.01/0.98/0.011.086464e-02 1.000000e+00 1.000000e+00 1.000000e+00
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 3729 21/3/0 21.09/2.81/0.091.066667e-01 1.000000e+00 1.000000e+00 9.361702e-01
TRINITY_DN46874_c5_g2_TRINITY_DN46874_c5_g2_i2_g.23776_m.23776 3912 19/5/0 19.26/4.48/0.263.244997e-01 1.000000e+00 1.000000e+00 7.943262e-01
TRINITY_DN46347_c1_g3_TRINITY_DN46347_c1_g3_i1_g.21992_m.21992 115 19/5/0 19.26/4.48/0.263.244997e-01 1.000000e+00 1.000000e+00 7.943262e-01
Can do R in the command line!
b<-read.table("out.hwe",header=TRUE)
Structure of data:
> str(b)
'data.frame': 442 obs. of 8 variables:
$ CHR : Factor w/ 111 levels "TRINITY_DN35598_c0_g1_TRINITY_DN35598_c0_g1_i1_g.5802_m.5802",..: 65 65 100 100 100 100 100 100 88 88 ...
$ POS : int 4566 4665 978 1404 1722 3426 3729 3912 115 141 ...
$ OBS.HOM1.HET.HOM2.: Factor w/ 27 levels "10/11/3","11/0/13",..: 27 22 27 27 20 27 22 18 18 27 ...
$ E.HOM1.HET.HOM2. : Factor w/ 16 levels "10.01/10.98/3.01",..: 14 12 14 14 11 14 12 10 10 14 ...
$ ChiSq_HWE : num 0.0109 0.1067 0.0109 0.0109 0.1983 ...
$ P_HWE : num 1 1 1 1 1 1 1 1 1 1 ...
$ P_HET_DEFICIT : num 1 1 1 1 1 1 1 1 1 1 ...
$ P_HET_EXCESS : num 1 0.936 1 1 0.874 ...
Which ones have heterozygous excess?
> b[which(b$P_HET_EXCESS < 0.001 ),]
[1] CHR POS OBS.HOM1.HET.HOM2. E.HOM1.HET.HOM2.
[5] ChiSq_HWE P_HWE P_HET_DEFICIT P_HET_EXCESS
<0 rows>
(or 0-length row.names)
Go into the rows and tell us the rows that satisfy some criteria
Let's look at deficit
> b[which(b$P_HET_DEFICIT < 0.01 ),]
CHR POS
277 TRINITY_DN45155_c27_g2_TRINITY_DN45155_c27_g2_i2_g.18743_m.18743 216
291 TRINITY_DN45155_c27_g1_TRINITY_DN45155_c27_g1_i1_g.18742_m.18742 99
293 TRINITY_DN45155_c27_g1_TRINITY_DN45155_c27_g1_i1_g.18742_m.18742 138
401 TRINITY_DN39079_c3_g1_TRINITY_DN39079_c3_g1_i1_g.8354_m.8354 244
406 TRINITY_DN39696_c4_g1_TRINITY_DN39696_c4_g1_i1_g.8926_m.8926 283
OBS.HOM1.HET.HOM2. E.HOM1.HET.HOM2. ChiSq_HWE P_HWE P_HET_DEFICIT
277 22/0/2 20.17/3.67/0.17 24 1.418440e-03 1.418440e-03
291 11/0/13 5.04/11.92/7.04 24 9.114786e-08 9.114786e-08
293 19/0/5 15.04/7.92/1.04 24 6.498371e-06 6.498371e-06
401 13/0/11 7.04/11.92/5.04 24 9.114786e-08 9.114786e-08
406 13/0/11 7.04/11.92/5.04 24 9.114786e-08 9.114786e-08
P_HET_EXCESS
277 1
291 1
293 1
401 1
406 1
Evidence of non-random mating; inbreeding
Linkage equilibrium
Association of SNPs at one locus with SNPS at another.
Too many pairs; so you can set windows.
--geno-r2
genotypes of different loci
$ vcftools --vcf 2017-03-06_SNPcallfilter.recode.vcf --geno-r2
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf 2017-03-06_SNPcallfilter.recode.vcf
--max-alleles 2
--min-alleles 2
--geno-r2
After filtering, kept 24 out of 24 Individuals
Outputting Pairwise LD (bi-allelic only)
LD: Only using diploid individuals.
After filtering, kept 1180 out of a possible 1180 Sites
Run Time = 3.00 seconds
look at file
head out.geno.ld CHR POS1 POS2 N_INDV R^2 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2217 19 0.00653595 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2235 19 0.00308642 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2244 19 0.00308642 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2276 19 0.00308642 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2277 19 0.00308642 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2535 20 0.0557276 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2805 19 0.00308642 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2970 20 0.00277008 TRINITY_DN45147_c0_g1_TRINITY_DN45147_c0_g1_i3_g.18680_m.18680 2127 2994 20 0.0430622
Bring file into R
> LD<-read.table("out.geno.ld",header=TRUE)
> names(LD)
[1] "CHR" "POS1" "POS2" "N_INDV" "R.2"
Look at distance between pos 1 and 2
> LD$dist<-abs(LD$POS1 - LD$POS2)
> names(LD)
[1] "CHR" "POS1" "POS2" "N_INDV" "R.2" "dist"
Can the distance predict LD?
Kattia info update: Rate of evolution
effective pop size
mutation
effective pop size + linkage
NERR + spatiotemporal variation
Effective population size
Ne= effective population size; number of individuals that exchange genes; estimate inside the genome
S= selection coefficient
4 methods to estimate NE:
Ne vary across genomes: depending on genetic hitchhiking, background selection, sex chromosomes and autosomes number ()
genetic hitchhiking is also known as selecive sweep.
Males have a higher mutation rate than females.
Hitchhiking and background selection is in the absence of recombination.
Mutation can occur at 2 levels:
whole gene or chromosome
base level = point mutation
substitution rate
synonymous - silent sites; amino acid does not change
non-synonymous - replacement mutations; results in a change in amino acid
classes taht fit into syn and non-syn
neutral - ; no natural selection, drift, more common in large genomes
slightly deleterious; small effect on fitness
slightly advantage
deleterious; negative on ; lower substitution rate
advantageous
mutation rate influenced by
Effect of linkage?
Ne and sub rate could be influenced by:
Skeller adding:
in bacteria
for diploid hemaphrotidic
for diploid separate sex it is
How do you estimate ?
Grab neutral sites!
then rearrange:
Coding session:
change directory into where data are:
cd /data/project_data/snps/reads2snps/
Info update: Dr. Kilpatrick:
Global ancestry
Local ancestry; chromosome is a mosaic. Which ancestry each of these segments come from?
Model based:
Waterson's theta = richness
expected heterozygosity = evenness
and pi diverge when populations are expanding and bottlenecks
is higher when populations are expanding; but smaller when in bottlenecks (double check this)
get final vcf data ; n=24
Filtering
output to home directory
Estimate allele frequencies between healthy and sick individuals (raw differences; (H)-(S))
Output to local machine and analyze in R
estimate , ,
Glossary:
Ascertainment bias- bias introduced by sampling design that induces a nonrandom sample
Sympatric speciation- presence of gene flow, via diversifying selection; alleles under selection are divergent; neutral alleles appear homogenous
Allopatric Speciation-absence of gene flow; physical isolation
Islands of differntiation: genomic regions with increased diff due to natural selection
Allele frequency spectrum- dist of counts of snps with an observed frequency (histogram of allele frequencies)
Gene flow- transfer of genes from one population to another
Diversifying selection- seleciton that favors different alleles in different parts of the species range
Info update from Erin Keller: Inferring history of divergence
Genomic scans:
distribution of differentiation in the genome
gene vs population trees
compare assumed pop trees to gene trees
compare differnt gene trees
measure introgression: D-statistic determines introgression
Limitations- throws out data, requires many genomes, same value multiple explanations
Model-based methods:
Shape of Allele frequency spectrum(AFS) can get a sense of population processes.
Sample | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
1 | 0 | 1 | 1 | 0 | 1 |
etc…to create AFS
Assumptions:
Limitations:
Geneology sampling
multiple regions sampled to produce 1 tree
Assumptions:
Likelihood-free method
ABC- approximate bayesian comp. Use simulations under models of interest.
Historical gene flow and LD patterns
Distribution of haplotype lengths: Assumes a haplotype as a migrant and enters a population of interest. Over time, these genomes recombine and see the slow shortening of sequence lenght. Based on lengths of original mirgrant, you can get a sense of time of when migrant entered populations. Recombination leads to smaller fragments over time. Many demographic events can produce similar pattern. Problems with identifying migrant.
Skeller: single genome- look at blocks of heterozygosity and homozygosity and gives an estimate population size during a recent history.
Approximation of conditional likelihood: Ancestral recombination graphs (ARGs). Gene tree that bifurcates based on recombination. estimate migration and coalescent events.
Limitations: complex-difficult to estimate recombination rates. Multiple completing recombination trees
Advantages of whole genome: recombination rates, lots of information, large "area" for genome scans Limitations of next gen: ascertainment bias, throw out data
Paper discussion:
Coding session: Population genomics part 4
finish part 3 first
output:
cat SSW_bamlist.txt.sum
################################################################################
# Biological Summary #
################################################################################
Selected ingroup species: sp
Number of analyzed individual: 24 (from 1 population(s))
Total number of contig used for sequence analysis: 1113
Total number of SNPs: 5040
Biallelic: 4991
- Triallelic: 49
- Quadriallelic: 0
Fit:
Average Fit: -0.0507419 [-0.06817; -0.031933]
(Fit calculated in 902 contigs)
Weir & Cockerham Fit (Evolution 1984):
Average Weir & Cockerham Fit: 0.00703754 [-0.017669; 0.032047]
(Fit calculated in 902 contigs)
piN/piS ratio:
Average piS in focal species: 0.00585312 [0.005172; 0.006598]
Average piN in focal species: 0.00154546 [0.00133; 0.001782]
Average piN / average piS: 0.264041 [0.223914; 0.310575]
(piS and piN calculated in 902 contigs of average length 50)
FIT is similar to FST but accounts for inbreeding. It ranges from 0-1, whereby 1 = more homozygosity. This number is low, so it suggests no inbreeding and may hint at little or no population structure.
Separate out neutral processes(demography,bottlenecks) and selection. Selection acts locally on the genome, whereas neutral processes act across the whole genome.
Glossary:
Selective sweep: pattern whereby a single adaptive allele sweeps through population(goes to near fixation)
hard sweeps: single adaptive allele in common genetic background
soft sweep: > 1 adaptive alleles in diff genetic backgrounds
Dr. Brody Info Update: Selective Sweeps
Big picture questioN: What maintains variation or produces variation? What reduces genetic variation amongm embers of populations. How do we discover those patterns and what do those patterns mean?
Why do we care? Heuristic perspective. Who else might care? Conservation, human population, We care about selection and other poplations evolve. HOw do we interpret those signatures? Predict when they might occur.
Not only do adaptive alelles become favored, but the genes linked to it are increasing in the population. Linked genes that may not be adaptive are refereed to as hitchhiking genes.
Pleuni Pennings: Stanford University. Her phd students have 2 minute youtube videos for each of their papers.
Sweeps can be hard on a local scale but soft on a global scale!!!
is a rate in which a mutation enters a population
What influences the ability for alleles to become fixed?
Small populations should fix alleles more rapidly.
Suppose we see differences in alleles in gene regions among populations. Populations are habitat or healthy and sick.
What are alternative hypotheses?
Limitations?
Steven leading discussion
Think about genetic targets of adaptation (local adaptaiton). How do selective sweeps happen and how are they distinguished?
Tajima's D:
D = 0 = neutrally evolving
D = negative = all alleles are recent and rare within the population (We expect this under selection)
D = positive = rare mutations are missing relative to what we expect from a random mutation processs.
Lauren Ashlock info update
Admixture coding session:
For a dataset of i individuals in j SNPs.
G = genotype
Q = ancestry (proportion of an individual's genome)
P = allele frequency
Concepts:
Two questions to consider:
Inbreeding coefficient and Structured Populations
F-statistics: created by Sewell Wright
Considers heterozygosity
Heterozygosity: 2 alelles sampled from a population are identical by descent.
Fst:
Fis:
Fit:
Inbreeding influences Fst.
Inbreeding coefficient goes from 0-1; 0 = no inbreeding, 1 is completely inbreeding.
Inbreeding at the local within pop scale, is concave. At the regional level/scale, genetic variation is higher.
Total genetic variation
Melissa info update on annotations:
Annotations take a long time. Imagine a funnel. Broad to specific
For the annotation:
Blast2Go—need to pay
Brute force
Pipelines: Trinotate
BLAST: taking a fasta file and search a database (NCBI non-redundant)
Blastp: proteins
Blastx: translate all 6 reading frames
Announcements:
Info update: SKeller; Metagenomics
What kind of data do we have?
What microbes are present? What are they doing?
What is a microbiome? (Who are they?)
Assemblage of microbial taxa associated with host or environment. This includes: pathogens, symbionts, mutualists, and commensals
Holobiont is host + microbiome
How do you measure a microbiome?
Metagenomics (What are they doing)
Genes being expressed by he microbiome, FUNCTION. The focus is more on RNA-seq or shot gun DNA seq.
How do we process these data?!?!
REVOLUTION!
Most microbes are not culturable.
Glossary:
KTyler info update:
Microbiome can be beneficial
Heritability: proportion of variance in a host-trait
Approaches:
Can be direct: twin pairs- monozygotic = 100% genes;
GWAS - Humans;
Cross study: QTL's
Limitations:
Future:
info update Aburnham:
Outline:
Rarefying vs rarefaction
rarefying = normalization of the data, subsampling depending on lowest N
rarefaction = not normalization; estimate species richness (extrapolation process)
Is what you find depend on method(sampling) or is it real?
Methods:
Past-
Rarefying (reads = sampling effort)
What does this do? Normalizes the data. Removing the large descrepancy in read numbers.
Problems!
USE A MIXTURE MODEL
What does this do?