Difference between revisions of "Generic Exome Analysis Plan"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with 'This page outlines a generic plan for analysis of a whole exome sequencing project. The idea is that the points listed here might serve as a starting point for discussion of the …')
 
(Read Mapping)
Line 16: Line 16:
 
: Calculate the input number of reads and number of bases for each sequenced sample
 
: Calculate the input number of reads and number of bases for each sequenced sample
  
=== Read Mapping ===
+
== Read Mapping ==
  
 
; Map Reads with Appropriate Read Mapper
 
; Map Reads with Appropriate Read Mapper
Line 25: Line 25:
 
: We should also tally the proportion of reads that map
 
: We should also tally the proportion of reads that map
 
:* Inside the target regions
 
:* Inside the target regions
:* Near the target regions
+
:* Near the target regions (defined as within 200bp of each target)
:* Elsewhere in the genome
+
:* Elsewhere in the genome (defined as regions that are >200bp from each target)
  
 
; Recalibrate Base Quality Scores
 
; Recalibrate Base Quality Scores
Line 34: Line 34:
 
: Generate new curves with base quality scores per position.
 
: Generate new curves with base quality scores per position.
 
: Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 ''equivalent'' bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.
 
: Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 ''equivalent'' bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.
 +
 +
== Verify Sample Identities ==
 +
 +
; Verify that Each Sequenced Sample Matches Prior Information
 +
: We should verify that each sequenced sample matches previous genotypes for that sample. Ideally, this should be done using a likelihood based approach that can also identify potentially contaminated samples.
 +
 +
; Adjudicate Mislabeled Samples
 +
: If any samples that don't match prior genotype data are encountered, the read data for these samples should be compared to all available samples to identify potential sample mixups.
 +
 +
; Identify Potentially Related Samples
 +
: Perhaps this step should be done *after* variant calling?
 +
 +
== Generate Variant Calls ==
 +
 +
; Generate Initial Set of Variant Calls
 +
: Variant calls should be generated taking into account all available samples simultaneously. We should consider variants calls that fall in target regions but also those that fall near the target. An open question is whether off target variant calls will be trustworthy.
 +
 +
; Generate Linkage Disequilibrium Aware Set of Variant Calls
 +
: A typical exome call set might include many sites where no variant is called due to low coverage. If we can integrate samples with previous [[GWAS]] data, we should be able to generate an update and much improved set of variant calls for each individual.
 +
 +
; Annotate Functional Impact of Each Variant
 +
: Called variants should be annotated according to their potential function. At a minimum, we should distinguish synonymous, non-synonymous, conserved splice site, 5'UTR, 3'UTR and other variants. Ideally, we should also assess [[SIFT]] and [[PolyPhen]] scores for conserved variants.
 +
 +
; Calculate Overall Frequency Spectrum
 +
: Calculate observed frequency spectrum and compare to neutral expectations (which are that the number of variants should be roughly proportional to ''1/n'', where ''n'' is the number of minor alleles).
 +
 +
; Annotate Overall Variant Characteristics
 +
: Calculate overall ratio of transitions to transversions, separately for coding and non-coding variants. Within coding variants, analyse synonymous and non-synonymous variants separately.
 +
 +
; Tabulate for Each Sample
 +
:* The number of synonymous and non-synonymous variants
 +
:* The number of unique and shared variants
 +
:* The number of transitions and transversions
 +
 +
== Special Considerations for Admixed Samples ==
 +
 +
; Estimate Local Ancestry Using GWAS Data
 +
: For studies that include admixed samples, we should estimate local ancestry using GWAS data. If GWAS are not available, it is strongly recommended that these data should be generated. In principle, local ancestry estimates can be generated even before exome sequencing is complete.
 +
 +
== Initial Association Analyses ==
 +
 +
We anticipate that, at least early on, the initial association analysis of whole genome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.

Revision as of 22:48, 20 April 2010

This page outlines a generic plan for analysis of a whole exome sequencing project. The idea is that the points listed here might serve as a starting point for discussion of the analyses needed in a specific project.

Read Mapping and Variant Calling

The first step in any analysis is to map sequence reads, callibrate base qualities, and call variants. Even at this stage, some simple quality metrics can be evaluated and will help identify potentially problematic samples.

Prior to Mapping

Evaluate Base Composition Along Reads
Calculate the proportion of A, C, G, T bases along each read. Flag runs with evidence of unusual patterns of base composition compared to the target genome.
Evaluate Machine Quality Scores Along Reads
Calculate average quality scores per position. Flag runs with evidence of unusual quality score distributions.
Calculate Number of Reads
Calculate the input number of reads and number of bases for each sequenced sample

Read Mapping

Map Reads with Appropriate Read Mapper
Currently, [bio-bwa.sourceforge.net/bwa.shtm BWA] is a convenient, widely used read mapper.
Basic Mapping Statistics
We should tally the overall proportion of mapped reads.
We should also tally the proportion of reads that map
  • Inside the target regions
  • Near the target regions (defined as within 200bp of each target)
  • Elsewhere in the genome (defined as regions that are >200bp from each target)
Recalibrate Base Quality Scores
Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent 1000 Genome Project analyses.
Update Base Quality Score Metrics
Generate new curves with base quality scores per position.
Calculate the number of mapped bases that reach at least Q20. Potentially, calculate Q20 equivalent bases by summing the quality scores for bases with base quality >Q20 and dividing the total by 20.

Verify Sample Identities

Verify that Each Sequenced Sample Matches Prior Information
We should verify that each sequenced sample matches previous genotypes for that sample. Ideally, this should be done using a likelihood based approach that can also identify potentially contaminated samples.
Adjudicate Mislabeled Samples
If any samples that don't match prior genotype data are encountered, the read data for these samples should be compared to all available samples to identify potential sample mixups.
Identify Potentially Related Samples
Perhaps this step should be done *after* variant calling?

Generate Variant Calls

Generate Initial Set of Variant Calls
Variant calls should be generated taking into account all available samples simultaneously. We should consider variants calls that fall in target regions but also those that fall near the target. An open question is whether off target variant calls will be trustworthy.
Generate Linkage Disequilibrium Aware Set of Variant Calls
A typical exome call set might include many sites where no variant is called due to low coverage. If we can integrate samples with previous GWAS data, we should be able to generate an update and much improved set of variant calls for each individual.
Annotate Functional Impact of Each Variant
Called variants should be annotated according to their potential function. At a minimum, we should distinguish synonymous, non-synonymous, conserved splice site, 5'UTR, 3'UTR and other variants. Ideally, we should also assess SIFT and PolyPhen scores for conserved variants.
Calculate Overall Frequency Spectrum
Calculate observed frequency spectrum and compare to neutral expectations (which are that the number of variants should be roughly proportional to 1/n, where n is the number of minor alleles).
Annotate Overall Variant Characteristics
Calculate overall ratio of transitions to transversions, separately for coding and non-coding variants. Within coding variants, analyse synonymous and non-synonymous variants separately.
Tabulate for Each Sample
  • The number of synonymous and non-synonymous variants
  • The number of unique and shared variants
  • The number of transitions and transversions

Special Considerations for Admixed Samples

Estimate Local Ancestry Using GWAS Data
For studies that include admixed samples, we should estimate local ancestry using GWAS data. If GWAS are not available, it is strongly recommended that these data should be generated. In principle, local ancestry estimates can be generated even before exome sequencing is complete.

Initial Association Analyses

We anticipate that, at least early on, the initial association analysis of whole genome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.