Difference between revisions of "ContaminationDetection"
From Genome Analysis Wiki
Jump to navigationJump to searchLine 21: | Line 21: | ||
== History == | == History == | ||
− | * Initial description of | + | * Initial description of method 1 can be found at [[Verifying_Sample_Identities_-_Implementation]] (last modified in April 29, 2010) |
Revision as of 07:10, 24 April 2012
Overview
DNA sample contamination (or sample swap) is quite common in sequencing studies. It is possible to detect DNA contamination from sequence reads of intensity data from SNP genotyping arrays by modeling the likelihood of sequence reads as a mixture of two samples and estimating the fraction of reads contributed by contaminating samples.
Depending on the types of available data, we can detect contamination in the following ways.
- When a sequenced sample has also external array-based genotypes available
- VerifyBamID software can estimate the levels of sample contamination and detect sample swap from alignment sequence reads and the external genotypes
- The basic mathematical concept is described in Verifying_Sample_Identities_-_Implementation
- When a sequenced sample does not have external genotypes available
- VerifyBamID software can still estimate sample contamination from aligned sequence reads and population minor allele frequency
- The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.
- When a sample has array-based genotypes but not yet sequenced
- VerifyIntensity software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
- The key idea is similar to that of VerifyBamID with sequence data alone
- VerifyIntensity models array intensity data instead of sequence reads
- When a large number of samples are genotyped together, the 'multi-sample' option will estimate the intensity distribution for each marker across multiple samples. When only one a few samples are genotyped, 'per-sample' option will estimate the intensity distribution for each sample across all markers.
- BAFRegress software can estimate levels of sample contamination and test the presence of contamination by regressing the allele frequency with respect to the B allele frequency (BAF).
- This method is sensitive to estimate low levels of contamination.
- The method provides a well calibrated Type I error to test the null hypothesis of no contamination from a DNA sample.
- VerifyIntensity software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
History
- Initial description of method 1 can be found at Verifying_Sample_Identities_-_Implementation (last modified in April 29, 2010)