Changes

410 bytes added , 13:07, 21 May 2010

→‎Read Mapping

Line 21: Line 21:

; Map Reads with Appropriate Read Mapper

−

: Currently, [bio-bwa.sourceforge.net/bwa.shtm BWA] is a convenient, widely used read mapper.

+

: Currently, [http://bio-bwa.sourceforge.net/bwa.shtm BWA] is a convenient, widely used read mapper.

; Mark Duplicate Reads

Line 34: Line 34:

; Recalibrate Base Quality Scores

−

: Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent [[1000 ~~Genome~~ Project]] analyses.

+

: Base quality scores can be updated by comparing sites that are unlikely to vary (such as those not currently reported as variants in dbSNP or in the most recent [http://www.1000genomes.org 1000 Genomes Project] analyses.

; Update Base Quality Score Metrics

Line 88: Line 88:

; If Nuclear Families are Available, Assess Mendelian Consistency

−

: When reporting rates of Mendelian inconsistencies, report these not on a ~~per site basis~~ but~~, instead, based on the number~~ of sites with at least one non-reference call in the trio.

+

: When reporting rates of Mendelian inconsistencies, report these not as a fraction of all sites, but as a fraction of sites with at least one non-reference call in the trio.

== Variant Filters ==

−

Initial sets of SNP calls ~~invariable~~ will include many false positives. ~~As a proportion~~ of ~~total~~ variants called~~, the number of false positives~~ is likely to increase as more and more samples are sequenced. To keep the ~~number~~ of false positives under control, it is important to both apply an increasingly strict set of quality control filters but also to experimentally validate some newly discovered variants. Many of these filters are currently implemented in [[GATK]].

+

Initial sets of SNP calls invariably will include many false positives. The fraction of false positives among all variants called is likely to increase as more and more samples are sequenced. To keep the fraction of false positives under control, it is important to both apply an increasingly strict set of quality control filters but also to experimentally validate some newly discovered variants. Many of these filters are currently implemented in [[GATK]].

; Mapping Quality Filter

−

: Consider removing variants at ~~sides~~ with low mapping quality scores. Even if the average mapping quality score for a site is high, consider removing variants at sites where a ~~large~~ fraction of reads have low mapping quality scores.

+

: Consider removing variants at sites with low mapping quality scores. Even if the average mapping quality score for a site is high, consider removing variants at sites where a noticeable fraction of reads have low mapping quality scores.

; Allele Balance Filter

: Among individuals who are assigned an heterozygous genotype, check the proportion of reads supporting each allele. Consider filtering out variants where one of the alleles accounts for <30% of reads.

+

; Local Realignment Filter

+

: Consider removing single nucleotide variants at sites where local realignment of all covering sequence reads suggests that an [[indel]] polymorphism is present in the population.

; Read Depth

−

: Consider filtering out variants at sites where total read depth is unusually low. Unfortunately, capture protocols introduce very large amounts of variation in sequencing depth and it is usually not possible to ~~also~~ accurately filter out sites sequenced at very high depth.

+

: Consider filtering out variants at sites where total read depth is unusually low. Unfortunately, capture protocols introduce very large amounts of variation in sequencing depth and it is usually not possible to accurately filter out sites sequenced at very high depth.

+

; Strand Bias

+

: Check whether each variant allele is supported by reads that map to both strands. Variants that are supported by reads on only one strand are almost always false positives.

= Special Considerations for Admixed Samples =

Line 112: Line 118:

= Initial Association Analyses =

−

We anticipate that, at least early on, the initial association analysis of whole ~~genome~~ datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.

+

We anticipate that, at least early on, the initial association analysis of whole exome datasets in the context of complex trait association studies will focus on identifying and resolving quality control issues that might result in unexpected artifacts.

== Initial Single SNP Tests ==

Line 133: Line 139:

== Burden Tests ==

−

The same analyses that were originally carried for single variants should be carried out for groups of rare variants. In principle, one could ~~simple~~ use the presence of a rare variant (or a particular class of rare variant, such as a non-synonymous variant or a newly discovered variant) as a predictor and repeat the logistic regression, linear regression or genotype regression described above. For an initial pass, I think the precise form of this analysis is not critical, because the next step is to...

+

The same analyses that were originally carried for single variants should be carried out for groups of rare variants. In principle, one could simply use the presence of a rare variant (or a particular class of rare variant, such as a non-synonymous variant or a newly discovered variant) as a predictor and repeat the logistic regression, linear regression or genotype regression described above. For an initial pass, I think the precise form of this analysis is not critical, because the next step is to...

== More Q-Q Plots ==

Goncalo

Bureaucrats, Administrators

1,555

edits

Changes

Generic Exome Analysis Plan (view source)

Revision as of 13:07, 21 May 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools