Difference between revisions of "ContaminationDetection"

Revision as of 16:48, 24 April 2012

Overview

DNA sample contamination (or sample swap) is quite common in sequencing studies. It is possible to detect DNA contamination from sequence reads of intensity data from SNP genotyping arrays by modeling the likelihood of sequence reads as a mixture of two samples and estimating the fraction of reads contributed by contaminating samples.

Depending on the types of available data, one can detect DNA sample contamination in the following ways.

When a sequenced sample has also external array-based genotypes available
- VerifyBamID software can estimate the levels of sample contamination and detect sample swap from alignment sequence reads and the external genotypes
- The basic mathematical concept is described in Verifying_Sample_Identities_-_Implementation
When a sequenced sample does not have external genotypes available
- VerifyBamID software can still estimate sample contamination from aligned sequence reads and population minor allele frequency
- The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.
When a sample has array-based genotypes but not yet sequenced
- VerifyIDintensity software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
  - The key idea is similar to that of VerifyBamID with sequence data alone
  - VerifyIDintensity models array intensity data instead of sequence reads
  - When a large number of samples are genotyped together, the 'multi-sample' option will estimate the intensity distribution for each marker across multiple samples. When only one a few samples are genotyped, 'per-sample' option will estimate the intensity distribution for each sample across all markers.
- BAFRegress software can estimate levels of sample contamination and test the presence of contamination by regressing the allele frequency with respect to the B allele frequency (BAF).
  - This method is sensitive to estimate low levels of contamination.
  - The method provides a well calibrated Type I error to test the null hypothesis of no contamination from a DNA sample.

History

Initial description of method 1 can be found at Verifying_Sample_Identities_-_Implementation (last modified in April 29, 2010)

@@ Line 12: / Line 12: @@
 #* The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.
 # When a sample has array-based genotypes but not yet sequenced
-#* [[VerifyIntensity]] software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
+#* [[VerifyIDintensity]] software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
 #** The key idea is similar to that of [[VerifyBamID]] with sequence data alone
-#** [[VerifyIntensity]] models array intensity data instead of sequence reads
+#** [[VerifyIDintensity]] models array intensity data instead of sequence reads
 #** When a large number of samples are genotyped together, the 'multi-sample' option will estimate the intensity distribution for each marker across multiple samples. When only one a few samples are genotyped, 'per-sample' option will estimate the intensity distribution for each sample across all markers.
 #* [[BAFRegress]] software can estimate levels of sample contamination and test the presence of contamination by regressing the allele frequency with respect to the B allele frequency (BAF).

Difference between revisions of "ContaminationDetection"

Revision as of 16:48, 24 April 2012

Overview

History

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools