Difference between revisions of "ContaminationDetection"

From Genome Analysis Wiki
Jump to: navigation, search
(Overview)
Line 12: Line 12:
 
#* The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.  
 
#* The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.  
 
# When a sample has array-based genotypes but not yet sequenced
 
# When a sample has array-based genotypes but not yet sequenced
#* [[VerifyIDintensity]] software can estimate the levels of DNA sample contamination from aligned sequence reads and population allele frequency
+
#* [[VerifyIDintensity]] software can estimate the levels of DNA sample contamination from pre-computed intensity data using likelihood-based model.
 
#** The key idea is similar to that of [[VerifyBamID]] with sequence data alone
 
#** The key idea is similar to that of [[VerifyBamID]] with sequence data alone
 
#** [[VerifyIDintensity]] models array intensity data instead of sequence reads
 
#** [[VerifyIDintensity]] models array intensity data instead of sequence reads

Revision as of 16:19, 24 April 2012

Overview

DNA sample contamination (or sample swap) is quite common in sequencing studies. It is possible to detect DNA contamination from sequence reads of intensity data from SNP genotyping arrays by modeling the likelihood of sequence reads as a mixture of two samples and estimating the fraction of reads contributed by contaminating samples.

Depending on the types of available data, one can detect DNA sample contamination in the following ways.

  1. When a sequenced sample has also external array-based genotypes available
  2. When a sequenced sample does not have external genotypes available
    • VerifyBamID software can still estimate sample contamination from aligned sequence reads and population minor allele frequency
    • The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.
  3. When a sample has array-based genotypes but not yet sequenced
    • VerifyIDintensity software can estimate the levels of DNA sample contamination from pre-computed intensity data using likelihood-based model.
      • The key idea is similar to that of VerifyBamID with sequence data alone
      • VerifyIDintensity models array intensity data instead of sequence reads
      • When a large number of samples are genotyped together, the 'multi-sample' option will estimate the intensity distribution for each marker across multiple samples. When only one a few samples are genotyped, 'per-sample' option will estimate the intensity distribution for each sample across all markers.
    • BAFRegress software can estimate levels of sample contamination and test the presence of contamination by regressing the allele frequency with respect to the B allele frequency (BAF).
      • This method is sensitive to estimate low levels of contamination.
      • The method provides a well calibrated Type I error to test the null hypothesis of no contamination from a DNA sample.

History