Difference between revisions of "Generic Exome Analysis Plan"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
This page outlines a generic plan for analysis of a whole exome sequencing project. The idea is that the points listed here might serve as a starting point for discussion of the analyses needed in a specific project.
 
This page outlines a generic plan for analysis of a whole exome sequencing project. The idea is that the points listed here might serve as a starting point for discussion of the analyses needed in a specific project.
 +
 +
The initial version of this document was prepared with input from Shamil Sunayev, Ron Do and Goncalo Abecasis.
  
 
= Read Mapping and Variant Calling =
 
= Read Mapping and Variant Calling =
Line 19: Line 21:
  
 
; Map Reads with Appropriate Read Mapper
 
; Map Reads with Appropriate Read Mapper
: Currently, [bio-bwa.sourceforge.net/bwa.shtm BWA] is a convenient, widely used read mapper.
+
: Currently, [http://bio-bwa.sourceforge.net/bwa.shtm BWA] is a convenient, widely used read mapper.
 +
 
 +
; Mark Duplicate Reads
 +
: Duplicate reads, whether generated as PCR artifacts during library preparation or optical duplicates during image analysis and base-calling, can mislead variant calling algorithms. To avoid problems, one typically removes from consideration all reads that appear to map at exactly the same location (for paired ends, only reads for which both ends map to the same locations are excluded.)
  
 
; Basic Mapping Statistics
 
; Basic Mapping Statistics
Line 29: Line 34:
  
 
; Recalibrate Base Quality Scores
 
; Recalibrate Base Quality Scores
: Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent [[1000 Genome Project]] analyses.
+
: Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent [http://www.1000genomes.org 1000 Genomes Project] analyses.
  
 
; Update Base Quality Score Metrics
 
; Update Base Quality Score Metrics
 
: Generate new curves with base quality scores per position.
 
: Generate new curves with base quality scores per position.
 
: Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 ''equivalent'' bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.
 
: Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 ''equivalent'' bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.
 +
 +
; Evaluate Coverage as Function of GC Content
 +
: For each target region, calculate read depth and also the proportion of GC bases in the reference genome. Flag samples where coverage varies strongly as a function of GC content.
  
 
== Verify Sample Identities ==
 
== Verify Sample Identities ==
Line 80: Line 88:
  
 
; If Nuclear Families are Available, Assess Mendelian Consistency
 
; If Nuclear Families are Available, Assess Mendelian Consistency
: When reporting rates of Mendelian inconsistencies, report these not on a per site basis but, instead, based on the number of sites with at least one non-reference call in the trio.
+
: When reporting rates of Mendelian inconsistencies, report these not as a fraction of all sites, but as a fraction of sites with at least one non-reference call in the trio.
  
 
== Variant Filters ==
 
== Variant Filters ==
  
Initial sets of SNP calls invariable will include many false positives. As a proportion of total variants called, the number of false positives is likely to increase as more and more samples are sequenced. To keep the number of false positives under control, it is important to both apply an increasingly strict set of quality control filters but also to experimentally validate some newly discovered variants. Many of these filters are currently implemented in [[GATK]].
+
Initial sets of SNP calls invariably will include many false positives. The fraction of false positives among all variants called is likely to increase as more and more samples are sequenced. To keep the fraction of false positives under control, it is important to both apply an increasingly strict set of quality control filters but also to experimentally validate some newly discovered variants. Many of these filters are currently implemented in [[GATK]].
  
 
; Mapping Quality Filter
 
; Mapping Quality Filter
: Consider removing variants at sides with low mapping quality scores. Even if the average mapping quality score for a site is high, consider removing variants at sites where a large fraction of reads have low mapping quality scores.
+
: Consider removing variants at sites with low mapping quality scores. Even if the average mapping quality score for a site is high, consider removing variants at sites where a noticeable fraction of reads have low mapping quality scores.
  
 
; Allele Balance Filter
 
; Allele Balance Filter
 
: Among individuals who are assigned an heterozygous genotype, check the proportion of reads supporting each allele. Consider filtering out variants where one of the alleles accounts for <30% of reads.
 
: Among individuals who are assigned an heterozygous genotype, check the proportion of reads supporting each allele. Consider filtering out variants where one of the alleles accounts for <30% of reads.
 +
 +
; Local Realignment Filter
 +
: Consider removing single nucleotide variants at sites where local realignment of all covering sequence reads suggests that an [[indel]] polymorphism is present in the population.
  
 
; Read Depth
 
; Read Depth
: Consider filtering out variants at sites where total read depth is unusually low. Unfortunately, capture protocols introduce very large amounts of variation in sequencing depth and it is usually not possible to also accurately filter out sites sequenced at very high depth.  
+
: Consider filtering out variants at sites where total read depth is unusually low. Unfortunately, capture protocols introduce very large amounts of variation in sequencing depth and it is usually not possible to accurately filter out sites sequenced at very high depth.
 +
 
 +
; Strand Bias
 +
: Check whether each variant allele is supported by reads that map to both strands. Variants that are supported by reads on only one strand are almost always false positives.
  
 
= Special Considerations for Admixed Samples =
 
= Special Considerations for Admixed Samples =
Line 104: Line 118:
 
= Initial Association Analyses =
 
= Initial Association Analyses =
  
We anticipate that, at least early on, the initial association analysis of whole genome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.
+
We anticipate that, at least early on, the initial association analysis of whole exome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.
  
 
== Initial Single SNP Tests ==
 
== Initial Single SNP Tests ==
Line 125: Line 139:
 
== Burden Tests ==
 
== Burden Tests ==
  
The same analyses that were originally carried for single variants should be carried out for groups of rare variants. In principle, one could simple use the presence of a rare variant (or a particular class of rare variant, such as a non-synonymous variant or a newly discovered variant) as a predictor and repeat the logistic regression, linear regression or genotype regression described above. For an initial pass, I think the precise form of this analysis is not critical, because the next step is to...
+
The same analyses that were originally carried for single variants should be carried out for groups of rare variants. In principle, one could simply use the presence of a rare variant (or a particular class of rare variant, such as a non-synonymous variant or a newly discovered variant) as a predictor and repeat the logistic regression, linear regression or genotype regression described above. For an initial pass, I think the precise form of this analysis is not critical, because the next step is to...
  
 
== More Q-Q Plots ==
 
== More Q-Q Plots ==

Latest revision as of 13:07, 21 May 2010

This page outlines a generic plan for analysis of a whole exome sequencing project. The idea is that the points listed here might serve as a starting point for discussion of the analyses needed in a specific project.

The initial version of this document was prepared with input from Shamil Sunayev, Ron Do and Goncalo Abecasis.

Read Mapping and Variant Calling

The first step in any analysis is to map sequence reads, callibrate base qualities, and call variants. Even at this stage, some simple quality metrics can be evaluated and will help identify potentially problematic samples.

Prior to Mapping

Evaluate Base Composition Along Reads
Calculate the proportion of A, C, G, T bases along each read. Flag runs with evidence of unusual patterns of base composition compared to the target genome.
Evaluate Machine Quality Scores Along Reads
Calculate average quality scores per position. Flag runs with evidence of unusual quality score distributions.
Calculate Number of Reads
Calculate the input number of reads and number of bases for each sequenced sample

Read Mapping

Map Reads with Appropriate Read Mapper
Currently, BWA is a convenient, widely used read mapper.
Mark Duplicate Reads
Duplicate reads, whether generated as PCR artifacts during library preparation or optical duplicates during image analysis and base-calling, can mislead variant calling algorithms. To avoid problems, one typically removes from consideration all reads that appear to map at exactly the same location (for paired ends, only reads for which both ends map to the same locations are excluded.)
Basic Mapping Statistics
We should tally the overall proportion of mapped reads.
We should also tally the proportion of reads that map
  • Inside the target regions
  • Near the target regions (defined as within 200bp of each target)
  • Elsewhere in the genome (defined as regions that are >200bp from each target)
Recalibrate Base Quality Scores
Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent 1000 Genomes Project analyses.
Update Base Quality Score Metrics
Generate new curves with base quality scores per position.
Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 equivalent bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.
Evaluate Coverage as Function of GC Content
For each target region, calculate read depth and also the proportion of GC bases in the reference genome. Flag samples where coverage varies strongly as a function of GC content.

Verify Sample Identities

Verify that Each Sequenced Sample Matches Prior Information
We should verify that each sequenced sample matches previous genotypes for that sample. Ideally, this should be done using a likelihood based approach that can also identify potentially contaminated samples.
Adjudicate Mislabeled Samples
If any samples that don't match prior genotype data are encountered, the read data for these samples should be compared to all available samples to identify potential sample mixups.
Identify Potentially Related Samples
Perhaps this step should be done *after* variant calling?

Generate Variant Calls

Generate Initial Set of Variant Calls
Variant calls should be generated taking into account all available samples simultaneously. We should consider variants calls that fall in target regions but also those that fall near the target. An open question is whether off target variant calls will be trustworthy.
Generate Linkage Disequilibrium Aware Set of Variant Calls
A typical exome call set might include many sites where no variant is called due to low coverage. If we can integrate samples with previous GWAS data, we should be able to generate an update and much improved set of variant calls for each individual. Due to limitations in current calling methods, the quality of variant call sets is expected to increase substantially if multiple variant callers are used and their results are merged democratically.
Annotate Functional Impact of Each Variant
Called variants should be annotated according to their potential function. At a minimum, we should distinguish synonymous, non-synonymous, conserved splice site, 5'UTR, 3'UTR and other variants. Ideally, we should also assess SIFT and PolyPhen scores for conserved variants.
Calculate Overall Frequency Spectrum
Calculate observed frequency spectrum and compare to neutral expectations (which are that the number of variants should be roughly proportional to 1/n, where n is the number of minor alleles).
Annotate Overall Variant Characteristics
Calculate overall ratio of transitions to transversions, separately for coding and non-coding variants. Within coding variants, analyse synonymous and non-synonymous variants separately.
CpG Sites
Calculate the rate of per base pair heterozygosity at potential CpG sites and compare this to other sites.
Tabulate for Each Sample
  • The number of synonymous and non-synonymous variants
  • The number of unique and shared variants
  • The number of transitions and transversions

Evaluate Variant Calls for Reference Samples

If Previously Sequenced Samples Available, Assess Variant Calls There
Assess concordance rate at non-reference sites
If Duplicate Samples are Available, Assess Concordance
Assess concordance rate at non-reference sites.
If Nuclear Families are Available, Assess Mendelian Consistency
When reporting rates of Mendelian inconsistencies, report these not as a fraction of all sites, but as a fraction of sites with at least one non-reference call in the trio.

Variant Filters

Initial sets of SNP calls invariably will include many false positives. The fraction of false positives among all variants called is likely to increase as more and more samples are sequenced. To keep the fraction of false positives under control, it is important to both apply an increasingly strict set of quality control filters but also to experimentally validate some newly discovered variants. Many of these filters are currently implemented in GATK.

Mapping Quality Filter
Consider removing variants at sites with low mapping quality scores. Even if the average mapping quality score for a site is high, consider removing variants at sites where a noticeable fraction of reads have low mapping quality scores.
Allele Balance Filter
Among individuals who are assigned an heterozygous genotype, check the proportion of reads supporting each allele. Consider filtering out variants where one of the alleles accounts for <30% of reads.
Local Realignment Filter
Consider removing single nucleotide variants at sites where local realignment of all covering sequence reads suggests that an indel polymorphism is present in the population.
Read Depth
Consider filtering out variants at sites where total read depth is unusually low. Unfortunately, capture protocols introduce very large amounts of variation in sequencing depth and it is usually not possible to accurately filter out sites sequenced at very high depth.
Strand Bias
Check whether each variant allele is supported by reads that map to both strands. Variants that are supported by reads on only one strand are almost always false positives.

Special Considerations for Admixed Samples

Estimate Local Ancestry Using GWAS Data
For studies that include admixed samples, we should estimate local ancestry using GWAS data. If GWAS are not available, it is strongly recommended that these data should be generated. In principle, local ancestry estimates can be generated even before exome sequencing is complete.
Estimate Global Ancestry Covariates Using PCA or MDS Analysis

Initial Association Analyses

We anticipate that, at least early on, the initial association analysis of whole exome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.

Initial Single SNP Tests

In principle, these tests only have power for common variants. In practice, particularly when permutation based methods are used to assess significance, there should be little loss in power by testing all variants using single SNP tests. These tests should include:

Logistic Regression Based Tests for Discrete Traits
Discrete outcomes should be evaluated using logistic regression. It is important to include appropriate covariates. For most traits, these might include age and sex and, potentially, principal components of ancestry.
Linear Regression Based Tests for Quantitative Traits
For quantitative traits that are not strongly selected, linear regression based tests can also be used. Again, it is important to include an appropriate set of covariates. For many quantitative traits, it may be a very good idea to normalize traits to minimize the impact of outliers on association results.
Using genotypes as outcomes for Selected Quantitative Traits
For both discrete and quantitative traits, analysis can be repeated using genotypes (scored as 0, 1 and 2) as outcomes and phenotypes as predictors.

Q-Q Plots

After carrying out initial single SNP tests, generate Q-Q plots for each analysis. Verify that Q-Q plots are reasonable and that genomic control value is close to 1.0. If not, refine sample and variant filters as needed.

Burden Tests

The same analyses that were originally carried for single variants should be carried out for groups of rare variants. In principle, one could simply use the presence of a rare variant (or a particular class of rare variant, such as a non-synonymous variant or a newly discovered variant) as a predictor and repeat the logistic regression, linear regression or genotype regression described above. For an initial pass, I think the precise form of this analysis is not critical, because the next step is to...

More Q-Q Plots

After carrying out initial burden tests, generate Q-Q plots for each analysis. Verify that Q-Q plots are reasonable and that genomic control value is close to 1.0. If not, refine sample and variant filters as needed.

Visualize Results

A number of displays will likely be useful. Probably these should include:

Think You Are Done?

No way!!!

Indels and Structural Variants

You still need a plan to call and evaluate short insertions and deletions as well as larger structural variants.

Pathway Based Analyses

Carry out analyses that include groups of genes with similar biological function (for example, according to Gene Ontology or Kyoto Encyclopedia of Genes and Genomes annotations.