From Genome Analysis Wiki
Jump to: navigation, search


DNA sample contamination (or sample swap) is quite common in sequencing studies. It is possible to detect DNA contamination from sequence reads of intensity data from SNP genotyping arrays by modeling the likelihood of sequence reads as a mixture of two samples and estimating the fraction of reads contributed by contaminating samples.

Depending on the types of available data, one can detect DNA sample contamination in the following ways.

  1. When a sequenced sample has also external array-based genotypes available
  2. When a sequenced sample does not have external genotypes available
    • VerifyBamID software can still estimate sample contamination from aligned sequence reads and population minor allele frequency
    • The key idea of the method is to capture the excessive heterozygosity in the contaminated sample by modeling the sequence reads as mixture of independent samples.
  3. When a sample has array-based genotypes but not yet sequenced
    • VerifyIDintensity software can estimate the levels of DNA sample contamination from pre-computed intensity data using likelihood-based model.
      • The key idea is similar to that of VerifyBamID with sequence data alone
      • VerifyIDintensity models array intensity data instead of sequence reads
      • When a large number of samples are genotyped together, the 'multi-sample' option will estimate the intensity distribution for each marker across multiple samples. When only one a few samples are genotyped, 'per-sample' option will estimate the intensity distribution for each sample across all markers.
    • BAFRegress software can estimate levels of sample contamination and test the presence of contamination by regressing the allele frequency with respect to the B allele frequency (BAF).
      • This method is sensitive to estimate low levels of contamination.
      • The method provides a well calibrated Type I error to test the null hypothesis of no contamination from a DNA sample.
    • VICES jointly estimates contamination and its sources from genotyping array intensities.
      • Can be useful to determine where in the process contamination occurred and whether DNA can be salvaged from leftover DNA earlier in the genotyping pipeline, or to revise laboratory protocols.