Difference between revisions of "VerifyBamID"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(44 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
[[Category:Software|VerifyBamID]]
 
[[Category:Software|VerifyBamID]]
  
'''verifyBamID''' is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples.
+
'''verifyBamID''' is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples. '''verifyBamID''' can detect sample contamination and swaps when external genotypes are available. When external genotypes are not available, '''verifyBamID''' still robustly detects sample swaps.
  
 
== Download verifyBamID  ==
 
== Download verifyBamID  ==
  
To get a copy go to the [http://www.sph.umich.edu/csg/kang/verifyBamID/ VerifyBamID Download] download page.
+
To get a copy of verifyBamId, go to: https://github.com/statgen/verifyBamID/releases
 +
 
 +
Select the latest release and download in one of 3 ways:
 +
# Binary expected to run in Ubuntu x64 platform. In other platforms, please download the source distribution and build it.
 +
#* verifyBamID.#.#.#.gz
 +
#* You will need to run "gunzip" on the .gz file
 +
# Souce Code including libStatGen (uses a fixed version of libStatGen)
 +
#* verifyBamIDLibStatGen.#.#.#.tgz
 +
#* Run "tar xvf" on this file.  Cd into the resulting directory & type make.
 +
# Source Code without libStatGen (allows alternative/newer versions of libStatGen)
 +
#* Source code (tar.gz) or Source code (zip)
 +
#* You will need to download libStatGen separately if you do not already have it.
 +
 
 +
 
 +
To get a copy of older releases go to the [http://csg.sph.umich.edu//kang/verifyBamID/download VerifyBamID Download] download page.
 +
 
 +
== Join in verifyBamID mailing list ==
 +
 
 +
Please join in the [http://groups.google.com/group/verifybamid VerifyBamID Google Group] to ask / discuss / comment about verifyBamID.
 +
 
 +
== What's new ==
 +
 
 +
(2014/02/13)
 +
* Put verifyBamID in github.
 +
* Added PhoneHome/Version Checking to VerifyBamID
 +
 
 +
(2012/06/20)
 +
* Fixed a bug of incorrect estimate of contamination when --chip-full option was used (Thanks to Richard Smith)
 +
* Fixed a bug of incorrect per-readgroup output in --chip-* parameter
 +
 
 +
(2012/05/24)
 +
* Fixed a bug of incorrect per-readgroup output (Thanks to Matthew Flickinger)
 +
* '''(IMPORTANT)''' Add an option to remove either side of overlapping fragment. This option is turned on by default, and can be turned off usig --ignoreOverlapPair. If your sequence data has very short insert size, this update may increase the sensitivity of estimated contamination.
 +
* Changes in the directory structure and Makefile
 +
 
 +
(2012/05/18) The new release of verifyBamID have undergone major change since the last version (as of 2011 April). Here are the highlights
 +
* The genotype / allele frequency file is now based on VCF format rather than PLINK format.
 +
* The reference sequence information is no longer required
 +
* Uses Brent's method for precise estimation of contamination parameters
 +
* Generate the depth distribution statistics.
 +
* Estimated reference-bias parameters (useful mostly for ABI SOLiD sequence data)
  
 
== Build verifyBamID  ==
 
== Build verifyBamID  ==
 +
 +
The binary download of verifyBamID is available. You may use that version in Ubuntu 64-bit platform.
 +
 +
If you download the source that includes libStatGen:
 +
tar xvf verifyBamIDLibStatGen.#.#.#.tgz
 +
cd verifyBamID_#.#.#
 +
make
 +
Executable: verifyBamID/bin/verifyBamID
 +
 +
If you download the source without libStatGen:
 +
tar xvf verifyBamID-#.#.#.tar.gz
 +
cd verifyBamID-1.1.0
 +
make cloneLib (if ../libStatGen does not exist)
 +
make
 +
Executable: ./bin/verifyBamID
 +
 +
Note that '''make cloneLib''' command will create a directory ../libStatGen under your verifyBamID directory, and '''make''' will create binary of verifyBamID under verifyBamID/bin/
 +
 +
If you have a different version of libStatGen at that path, then skip the cloneLib step.  If the libStatGen you want to use is at a different location then update verifyBamID's Makefile.inc.  Replace: LIB_PATH_VERIFY_BAM_ID ?= $(LIB_PATH_GENERAL) with
 +
LIB_PATH_VERIFY_BAM_ID = /path/to/libStatGen
 +
  
 
verifyBamID is designed to be reasonably portable.  
 
verifyBamID is designed to be reasonably portable.  
  
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.  
+
However, since development occurs only on Ubuntu (9.10-13.10) x86 and x64 platforms, and later, there are likely other portability issues.  
  
Currently we support verifyBamID only on Ubuntu 9.10 and later on 64-bit processors.
 
  
 
== Basic Usage ==
 
== Basic Usage ==
Line 25: Line 85:
 
Here is a typical command line:
 
Here is a typical command line:
  
  verifyBamID --reference [reference.fa] --in [inputReads.bam] --bfile [inputGenotypes] --out [outPrefix] --verbose
+
  verifyBamID --vcf [input.vcf] --bam [input.bam] --out [output.prefix] --verbose --ignoreRG
 
   
 
   
 
  where
 
  where
  [reference.fa] is a FASTA format file, preferably (but not necessarily) indexed a priori with karma or other libcsg-compatible software
+
  [input.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[inputReads.bam] is a BAM (Binary Alignment Map) file of a sequence reads
+
  [input.vcf] is input VCF file containing individual genotypes or AF or AC/AN fields in the INFO field. gzipped VCF is also allowed.
  [inputGenotypes] is input prefix of forward-stranded PLINK binary file
+
  [outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM,depthRG,depthSM} will be created.
  [outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM} will be created.
 
  
More detailed description of command line input is provided below
+
More detailed description of command line input is below
  
 
== Preparing input files ==
 
== Preparing input files ==
  
verifyBamID requires three input files - reference FASTA file, PLINK-format binary genotype file, and BAM file.
+
verifyBamID requires two input files - VCF file containing external genotypes or allele frequency information, and the BAM file.
  
=== Reference FASTA format file ===
+
=== VCF input genotype file ===
  
Reference file should be in FASTA format (typically [reference].fa). verifyBamID uses libcsg-format for efficiently indexing the [reference].fa file, creating [reference]-bs.umfa file in the directory where the FASTA file exists. If your FASTA file is not previously indexed, please make sure that the directory is writable. Otherwise, it will report errors when creating the index. The reference FASTA file must be compatible to the reference file in the original BAM file. For NCBI build 37 sequences, an example fasta file could be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz (Note that the file size is very large - 874MB)
+
The input VCF file contains (1) external genotype information and/or (2) allele frequency information as AF entry or AC/AN entries in the INFO field. (See [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 | VCF specification] for further details). If neither information is provided, verifyBamID will not work properly.
  
=== PLINK-format binary input genotype file ===
+
If external genotype information is provided, sequence+array method will identify contamination and sample swaps by comparing the concordance between the external genotypes and the sequence reads. Additionally, sequence-only method will provide additional contamination estimates by modeling the sequence reads as mixture of two unknown samples based on the allele frequency information in the VCF file.
  
verifyBamID requires either PLINK-compatible binary genotype file [inputGenotypes].{bim,bed,fam} or allele frequency files [inputGenotypes].{bimp} file. For detailed information of plink format please refer to http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed
+
Input VCF file needs to meet several additional contraints need to meet in order to properly run verifyBamID.
 +
* The VCF is assumed to be well-formed. For example, verifyBamID does not check whether REF allele actually matches with reference sequence. 
 +
* The VCF should only contain SNPs. Current version of verifyBamID does not accept INDELs, MNPs, Structural Variations, or other complex variants.
 +
* The individual IDs in the VCF file, must be identical with the individual identifier in the BAM file. Otherwise, --smID option can override the sample ID information of the BAM file to the ID that matches to the individual IDs in the VCF file.
 +
* IMPORTANT : For targeted sequencing data, it is important to subselect the markers to only include on-target markers in the genotype file. Off-target markers are not likely to have multiple non-duplicated reads at the marker position, and it may create artifacts in the analysis due to overlapping fragments.
 +
* Currently, verifyBamID takes only autosomal chromosomes as input VCF.
  
Input genotype file needs to meet several additional contraints need to meet in order to properly run verifyBamID.
+
An example input VCF file (without external genotype) is provided below. Note that AC and AC entries exists in the INFO field for the allele frequency information.
* The genotype file must be forward-stranded. In other words, in the marker information [inputGenotypes].bim file, either of the two alleles must match to the reference base provided as [reference].fa file. The marker information should also be sorted by genomic positions.
 
* The individual IDs in [inputGenotypes].fam file, must be compatible with the sequenced individual. In the header of the bamfile (which can be view by 'samtools view -H [inputReads].bam'), Each header starts with '@RG' contains 'SM' tag, and the representation of individual ID must be identical to the matching individual's ID in [inputGenotypes].fam file. For more information of BAM file specification, please refer to  http://samtools.sourceforge.net/
 
* The genotype file is assumed to be in SNP-major format (most binary PLINK formats are in SNP-major format by default).
 
* IMPORTANT : For targeted sequencing data, it is important to subselect the markers to only include on-target markers in the genotype file. Off-target markers are not likely to have multiple non-duplicated reads at the marker position, and it may create artifacts in the analysis.
 
  
When the genotype data of the sequenced individual is not available, sample contamination can possibly be detected by using population minor allele frequency. If the sequence reads are better explained as a mixture of multiple individuals rather than a single individual, the sample can be flagged for suspected contamination. verifyBamID takes alternative format for allele-frequency only data, which is the same format to PLINK's .bim format, with one additional column, as follows
+
#CHROM POS ID REF ALT QUAL FILTER INFO
 
+
  20 61651 SNP20-9651 C A . PASS CR=99.86851;GentrainScore=0.7055;HW=0.077647716;AN=2180;AC=11
  [CHROM] [SNP_ID] [CentiMorgan (could be 0)] [BASE-POSITION] [ALLELE1] [ALLELE2] [ALLELE1-FREQ]
+
  20 63231 SNP20-11231 T G . PASS CR=99.93036;GentrainScore=0.7837;HW=0.035481825;AN=2182;AC=275
 
+
  20 63244 rs6139074 A C . PASS CR=98.893394;GentrainScore=0.8001;HW=7.327299E-7;AN=2162;AC=501
Below is example lines of .bimp file, a modified file with ALLELE1-FREQUENCY.
+
  20 63799 rs1418258 C T . PASS CR=99.75217;GentrainScore=0.8170;HW=0.6653377;AN=2182;AC=881
 
 
1 rs3131972 0 752721 A G 0.1652
 
1 rs3131969 0 754182 A G 0.1339
 
  1 rs1048488 0 760912 C T 0.1667
 
1 rs12562034 0 768448 A G 0.1027
 
1 rs12124819 0 776546 G A 0.2902
 
1 rs4040617 0 779322 G A 0.1295
 
  1 rs4970383 0 838555 A C 0.1964
 
  1 rs4475691 0 846808 T C 0.1429
 
1 rs1806509 0 853954 C A 0.3884
 
1 rs7537756 0 854250 G A 0.1607
 
 
 
The HapMap3 CEU .bimp file can be downloaded at [http://www.sph.umich.edu/csg/kang/verifyBamID/ VerifyBamID Download] page
 
  
 
=== Input BAM file ===
 
=== Input BAM file ===
  
 
verifyBamID requires a sorted, indexed, base quality recalibrated, and duplication-marked BAM file. It also requires to contain "@RG" header lines to annotation different readGroups (sequencing runs and lanes). The SM tag in the "@RG" header should match with one of the genotyped sample. Otherwise, verifyBamID may not be able to test whether the sequenced sample matches with genotyped sample, but will try to detect sample mixture from allele frequency, and will try to detect the best-matching sample among the genotyped sample.
 
verifyBamID requires a sorted, indexed, base quality recalibrated, and duplication-marked BAM file. It also requires to contain "@RG" header lines to annotation different readGroups (sequencing runs and lanes). The SM tag in the "@RG" header should match with one of the genotyped sample. Otherwise, verifyBamID may not be able to test whether the sequenced sample matches with genotyped sample, but will try to detect sample mixture from allele frequency, and will try to detect the best-matching sample among the genotyped sample.
 
Two types of input BAM files can be allowed.
 
  
==== Single file, whole genome BAM file ====
+
== What the default option does ==
  
If an alignment file is a single BAM file across the genome, --in [inputReads].bam can be used to take the BAM file as input.
+
The default option of '''verifyBamID''' is the recommended setting for the most sequencing studies to provide a rapid and informative response. The default option provides the following features:
 +
* --free-mix is turned on for estimating contamination using sequence-only method
 +
* --chip-mix is turned on for estimating contamination or swap using sequence+array method, if the external genotype file is provided in the VCF
 +
* --self is turnd on : The default option does not try to compare the sequence reads to identify the best matching individual (which is possible with --best option). It only compares with the external genotypes from the same individual to the sequenced individual.
 +
* --maxDepth 20 is used without --precise option : The default option is intended for whole genome low coverage sequencing. For the targeted exome sequencing, --maxDepth 1000 and --precise is recommended.
 +
* --ignoreRG is not a default option, but a recommended option, when you want to check the contamination for the entire BAM rather than examining each read group separately. This option will increase the computational efficiency especially in the case whether the sequence reads are multiplexed across many sequencing runs.
  
==== Multiple files separated by chromosomes ====
+
== Interpreting output files ==
 
 
If the alignment file is split per chromosome (as in 1000 genomes project BAM file), --inprefix [prefix] --insuffix [suffix] should be used to take the BAM input files.
 
  
For example, if the following series of input BAM files need to be used as input files
+
See also [[Understanding VerifyBamID output]].
NA06985.chrom1.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
 
NA06985.chrom2.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
 
NA06985.chrom3.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
 
....
 
NA06985.chrom22.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
 
 
 
The following option should be used for input files (note dot after --insuffx)
 
--inprefix NA06985.chrom --insuffix .ILLUMINA.bwa.CEU.low_coverage.20100311.bam
 
 
 
== Interpreting output files ==
 
  
 
=== Output files ===
 
=== Output files ===
When verifyBamID runs successfully, it generates the following four (or two) files
+
When verifyBamID runs successfully, the following sets of files may be generated.
 
 
 
* [outPrefix].selfSM - Per-sample statistics describing how well the sample matches to the annotated sample.
 
* [outPrefix].selfSM - Per-sample statistics describing how well the sample matches to the annotated sample.
* [outPrefix].selfRG - Per-readGroup statistics describing how well each lane matches to the annotated sample.
+
* [outPrefix].depthSM - The depth distribution of the sequence reads per sample
* [outPrefix].bestSM - Per-sample best-match statistics with best-matching sample among the genotyped sample (won't be produced if there is no individual to compare to)
+
* [outPrefix].selfRG - Per-readGroup statistics describing how well each lane matches to the annotated sample. (available only without --ignoreRG option)
* [outPrefix].bestRG - Per-readgroup best-match statistics with best-matching sample among the genotyped sample (won't be produced if there is no individual to compare to)
+
* [outPrefix].depthRG - The depth distribution of the sequence reads per readGroup. (available only without --ignoreRG option)
 +
* [outPrefix].bestSM - Per-sample best-match statistics with best-matching sample among the genotyped sample (available only with --best option)
 +
* [outPrefix].bestRG - Per-readgroup best-match statistics with best-matching sample among the genotyped sample (available only with --best and without --ignoreRG option)
  
 
=== Column information in the output files ===
 
=== Column information in the output files ===
Each of these files have the following 27 columns per sample, or per readgroup (lane). When individual genotypes are unavailable,  The values of 3-20 will be "N/A"
+
The .selfSM/.selfRG/.bestSM/.bestRG files have the following 19 columns per sample, or per readgroup (lane).  
  
 
# SEQ_SM : Sample ID of the sequenced sample. Obtained from @RG header / SM tag in the BAM file
 
# SEQ_SM : Sample ID of the sequenced sample. Obtained from @RG header / SM tag in the BAM file
# RG : ReadGroup ID of sequenced lane. For [outPrefix].selfSM and [outPrefix].bestSM, these values are "N/A"
+
# RG : ReadGroup ID of sequenced lane. For [outPrefix].selfSM and [outPrefix].bestSM, these values are "ALL"
# SELF_SM/BEST_SM : Sample ID compared to in the genotype file. For [outPrefix].selfRG and [outPrefix].selfSM, these values should be identical to [SEQ_SM] or "N/A" if the genotype of sequenced samples are unavailable. For [outPrefix].bestRG and [outPrefix].bestSM, these values should be the ID of best-matching sample among the genotype files compared to.
+
# CHIP_ID : Sample ID compared to in the genotype file. For [outPrefix].selfRG and [outPrefix].selfSM, these values should be identical to [SEQ_SM] or "NA" if the genotype of sequenced samples are unavailable. For [outPrefix].bestRG and [outPrefix].bestSM, these values should be the ID of best-matching sample among the genotype files compared to.
# SELFIBD/BESTIBD : Maximum-likelihood estimation of fraction of reads in IBD(identity-by-descent) with SELF_SM/BEST_SM in the genotyped sample. 1-[SELFIBD] can be used as an estimated amount of contamination, unless sample swap has occurred.
+
# # SNPs : # of SNPs passing the criteria from the VCF file
# SELFIBDLLK/BESTIBDLLK : Log likelihood of the sequence reads given the MLE SELFIBD/BESTIBD with SELF_SM/BEST_SM
+
# # READS : Total # of reads loaded from the BAM file
# SELFIBDLK-/BESTIBDLK- : Difference of log-likelihood between SELFIBDLLK/BESTIBDLLK and the likelihood of reads under no contamination (SELFIBD=1).
+
# # AVG_DP : Average sequencing depth at the sites in the VCF file
# #GENOS : Number of non-missing genotypes in SELF_SM/BEST_SM
+
# FREEMIX : Sequence-only estimate of contamination (0-1 scale)
# #BASES : Number of bases in the sequenced reads (passing mapQ, baseQual, maxDepth threshold) matching the marker sites
+
# FREELK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence-only method
# %GENREF : Fraction of HomRef alleles in the genotypes across all markers, for SELF_SM/BEST_SM
+
# FREELK0 : Log-likelihood of the sequence reads given no contamination under sequence-only method
# %GENHET : Fraction of Heterozygous alleles in the genotypes across all markers, for SELF_SM/BEST_SM
+
# FREE_RH : Estimated reference bias parameter Pr(refBase|HET) (when --free-refBias or --free-full is used)
# %GENALT : Fraction of HomAlt alleles in the genotypes across all markers, for SELF_SM/BEST_SM
+
# FREE_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --free-refBias or --free-full is used)
 +
# CHIPMIX : Sequence+array estimate of contamination (NA if the external genotype is unavailable) (0-1 scale)
 +
# CHIPLK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence+array method (NA if the external genotypes are unavailable)
 +
# CHIPLK0 : Log-likelihood of the sequence reads given no contamination under sequence+array method (NA if the external genotypes are unavailable)
 +
# CHIP_RH : Estimated reference bias parameter Pr(refBase|HET) (when --chip-refBias or --chip-full is used)
 +
# CHIP_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --chip-refBias or --chip-full is used)
 
# DPREF : Depth (Coverage) of HomRef site (based on the genotypes of (SELF_SM/BEST_SM), passing mapQ, baseQual, maxDepth thresholds.
 
# DPREF : Depth (Coverage) of HomRef site (based on the genotypes of (SELF_SM/BEST_SM), passing mapQ, baseQual, maxDepth thresholds.
# RDPHET : DPHET/DPREF, Relative depth at Heterozygous site.
+
# RDPHET : DPHET/DPREF, Relative depth to HomRef site at Heterozygous site.
# RDPALT : DPHET/DPREF, Relative depth at HomAlt site.
+
# RDPALT : DPHET/DPREF, Relative depth to HomRef site at HomAlt site.
# REF-A1% : Fraction of reference bases in HomRef site
 
# REF-A2% : Fraction of alternative bases in HomRef site
 
# HET-A1% : Fraction of reference bases in Heterozygous site
 
# HET-A2% : Fraction of alternative bases in Heterozygous site
 
# ALT-A1% : Fraction of reference bases in HomAlt site
 
# ALT-A2% : Fraction of alternative bases in HomAlt site
 
# #DP>1 : Number of sites with depth of 2 or greater
 
# %HETAF : Fraction of heterozygous alleles estimated by population allele frequency
 
# %HETSEQ : Fraction of heterozygous alleles estimated from sequence data (without assuming base errors)
 
# EXHET : [%HETSEQ]/[%HETAF]
 
# %MIX : Maximum-likelihood estimate of % of contamination based on two-sample mixture model from population allele frequency.
 
# BESTMIXLLK : Log-likelihood of the data given the MLE %MIX
 
# BESTMIXLLK- : Difference of log-likelihood between BESTMIXLLK and the likelihood of reads under no contamination (%MIX=0).
 
  
 
=== A guideline to interpret output files ===
 
=== A guideline to interpret output files ===
Line 141: Line 172:
 
verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples  
 
verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples  
  
*  Each sample or lane can be checked in this way. When [SELF-IBD] < 1 AND [%MIX] > 0 AND [REF-A2%] > 0.01, meaning 1% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination.
+
*  Each sample or lane can be checked in this way. When [CHIPMIX] >> 0.02 and/or [FREEMIX] >> 0.02, meaning 2% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination.
* We recommend to check each lane for the possibility of sample swaps. When [SELF-IBD] << 1 AND [%MIX] ~ 0, then it is possible that the sample is swapped with another sample. When [BEST-IBD] ~ 1, [BEST_SM] might be actually the swapped sample. Otherwise, the swapped sample may not exist in the genotype data you have compared against.  
+
* We recommend to check each lane for the possibility of sample swaps. When [CHIPMIX] ~ 1 AND [FREEMIX] ~ 0, then it is possible that the sample is swapped with another sample. When [CHIPMIX] ~ 0 in .bestSM file, [CHIP_ID] might be actually the swapped sample. Otherwise, the swapped sample may not exist in the genotype data you have compared.  
* When genotype data is not available but allele-frequency-based estimates of [%MIX] >= 0.03 and [BESTMIXLLK-] is large (greater than 100), then it is possible that the sample is contaminated with other sample. We recommend to use per-sample data rather than per-lane data for checking this for low coverage data, because the inference will be more confident when there are large number of bases with depth 2 or higher.
+
* When genotype data is not available but allele-frequency-based estimates of [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, then it is possible that the sample is contaminated with other sample. We recommend to use per-sample data rather than per-lane data for checking this for low coverage data, because the inference will be more confident when there are large number of bases with depth 2 or higher.
  
 
== Command Line Options ==
 
== Command Line Options ==
  
  USAGE:  
+
  The following parameters are available.  Ones with "[]" are in effect:
 
   
 
   
  ./verifyBamID [--ibdUnit <double>] [--mixUnit <double>] [-f <double>]
+
  Available Options
                  [-g <double>] [-d <integer>] [-m <integer>] [-Q
+
                            Input Files : --vcf [], --bam [], --subset [],
                  <integer>] [-q <integer>] [-B <string>] [-b <string>] [-l
+
                                          --smID []
                  <string>] -o <string> [-I <string>] [-s <string>] [-p
+
                    VCF analysis options : --genoError [1.0e-03],
                  <string>] [-i <string>] -r <string> [-M] [-F] [-S] [-n]
+
                                          --minAF [0.01],
                  [-v] [--] [--version] [-h]
+
                                          --minCallRate [0.50]
+
  Individuals to compare with chip data : --site, --self, --best
+
          Chip-free optimization options : --free-none, --free-mix [ON],
Where:  
+
                                          --free-refBias, --free-full
+
          With-chip optimization options : --chip-none, --chip-mix [ON],
  --ibdUnit <double>
+
                                          --chip-refBias, --chip-full
    unit of IBD values (default: 0.01)
+
                    BAM analysis options : --ignoreRG, --ignoreOverlapPair,
+
                                          --noEOF, --precise, --minMapQ [10],
  --mixUnit <double>
+
                                          --maxDepth [20], --minQ [13],
    unit of % mixture (default:0.01)
+
                                          --maxQ [40], --grid [0.05]
+
                Modeling Reference Bias : --refRef [1.00], --refHet [0.50],
  -f <double>, --minAF <double>
+
                                          --refAlt [0.00]
    Minimum allele frequency (default: 0.005). Markers with lower allele
+
                          Output options : --out [], --verbose
    frequencies will be ignored
+
                              PhoneHome : --noPhoneHome,
+
                                          --phoneHomeThinning [50]
  -g <double>, --genoError <double>
+
 
    Error rate in genotype data (default: 0.005)
+
Each option provides the following features:
+
* --vcf : specify required VCF file
  -d <integer>, --maxDepth <integer>
+
* --bam : specify required BAM file (indexed with .bam.bai or .bai file)
    Maximum depth per site (default:20) - bases with higher depth will be
+
* --subset : list of individual IDs to calculate the allele frequency. All individuals are used if unspecified
    ignored due to possible alignment artifacts. For deep coverage data,
+
* --smID : If the individual ID in the BAM file and VCF file does not match, substitute the BAM file's ID into the specified argument
    it is important to set this value to a sufficiently large number (e.g.
+
* --genoError : error rate of the external genotype file
    200)
+
* --minAF : minimum allele frequency of the markers to include
+
* --minAF : minimum call rate of the markers to include
  -m <integer>, --minMapQ <integer>
+
* --site : If set, use only site information in the VCF and do not compare with the actual genotypes
    Minimum mapping quality value (default:10) - reads with lower quality
+
* --self : Only compare the ID-matching individuals between the VCF and BAM file
    will be ignored
+
* --best : Find the best matching individuals (.bestSM and .bestRG files will be produced). This option is substantially longer than the default option
+
* --free-none : Do not perform sequence-only method to estimate parameters
  -Q <integer>, --maxQ <integer>
+
* --free-mix : (default) Estimate contamination using sequence-only method with Brent's single dimensional optimization.
    Maximum Phred-scale base quality value (default:40) - higher base
+
* --free-refBias : Estimate the reference bias parameters using sequence-only method with Simplex method
    quality will be enforced to be this value
+
* --free-full : Estimate both reference bias parameters and the contamination parameters using sequence-only method
+
* --chip-none : Do not perform sequence+array method to estimate parameters
  -q <integer>, --minQ <integer>
+
* --chip-mix : (default) Estimate contamination using sequence+array method with Brent's single dimensional optimization.
    Minimum Phred-scale base quality value (default:20) - bases with lower
+
* --chip-refBias : Estimate the refernece bias parameters using sequence+array method with Simplex method
    quality will be ignored
+
* --chip-full : Estimate both reference bias parameters and the contamination parameters using sequence+array method
+
* --ignoreRG : ignore the read grouup level comparison and compare samples only (recommended for an expedited run)
  -B <string>, --bimpfile <string>
+
* --ignoreOverlapPair : ignore overlapping pair end fragment covering the same base. Disabling this option may decrease the sensitivity of the method when the insert size is short (with slight gain in the computational speed)
    PLINK format BIM file with allele frequency information at the last
+
* --noEOF : do not check the EOF marker of the BAM file (for earlier version of BAM)
    column
+
* --precise : calculate the likelihood in log-scale for high-depth data (recommended when --maxDepth is greater than 20. Can be a little bit slower)
+
* --minMapQ : minimum mapping quality of the sequence reads to compare
  -b <string>, --bfile <string>
+
* --minQ : minimum base quality to include
    Binary PLINK format genotype file prefix. Must be forward-stranded
+
* --maxQ : maximum base quality to cap
+
* --grid : the grid interval to search the optimum before running Brent's algorithm.
  -l <string>, --log <string>
+
* --refRef : Initial Pr(refBase|HOMREFGeno) parameter
    Log file - default: [out].log
+
* --refHet : Initial Pr(refBase|HETGeno) parameter
+
* --refAlt : Initial Pr(refBase|HOMALTGeno) parameter
  -o <string>,  --out <string>
+
* --out : output file prefix (required)
    (required)  Prefix of output files
+
* --verbose : print the progress of the method on the screeen
+
{{PhoneHomeParameters|hdr=====|bullet=1}}
  -I <string>,  --index <string>
 
    Index of input BAM file - [inputBam].bai will be default value
 
 
  -s <string>,  --insuffix <string>
 
    Suffix of input BAM file for multi-chromosome inputs
 
 
  -p <string>, --inprefix <string>
 
    Prefix of input BAM file for multi-chromosome inputs
 
 
  -i <string>,  --in <string>
 
    Input BAM file. Must be sorted and indexed
 
 
  -r <string>,  --reference <string>
 
    (required) Karma's reference sequence
 
 
  -M,  --mmap
 
    toggle whether to use memory map (default:true)
 
 
  -F,  --bimAF
 
    use the allele frequency information by loading .bimp file instead of
 
    .bim file
 
 
  -S,  --selfonly
 
    compare the genotypes with SELF (annotated sample) only (default:off)
 
 
  -n,  --noeof
 
    Do not check EOF marker for BAM file (default:off)
 
 
  -v,  --verbose
 
    Toggle verbose mode (default:off)
 
 
  ----ignore_rest
 
    Ignores the rest of the labeled arguments following this flag.
 
 
  --version
 
    Displays version information and exits.
 
 
  -h,  --help
 
    Displays usage information and exits.
 
  
 
== Principle of Operation ==
 
== Principle of Operation ==
Line 252: Line 244:
  
 
For more about the technical details, see the page [[Verifying Sample Identities - Implementation]]
 
For more about the technical details, see the page [[Verifying Sample Identities - Implementation]]
 +
 +
== Reference ==
 +
 +
Please cite the following paper:
 +
 +
G. Jun, M. Flickinger, K. N. Hetrick, Kurt, J. M. Romm, K. F. Doheny, G. Abecasis, M. Boehnke,and H. M. Kang, ''Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-Based Genotype Data'', American journal of human genetics doi:10.1016/j.ajhg.2012.09.004 (volume 91 issue 5 pp.839 - 848)
 +
 +
 +
== Contamination in Array Data ==
 +
 +
[[VerifyIDintensity]] or [[BAFRegress]] can estimate sample contamination from Illumina genotype array data.
  
 
== Acknowledgements ==
 
== Acknowledgements ==
  
VerifyBamID is a result from collaborative effort by Hyun Min Kang, Paul Anderson, Goo Jun, Mary Kate Trost, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [[mailto:hmkang@umich.edu| hmkang@umich.edu ]] for any questions.
+
VerifyBamID is a result from collaborative effort by Hyun Min Kang, Goo Jun, Matthew Flickinger, Mary Kate Wing, Goncalo Abecasis, and Michael Boehnke. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.

Latest revision as of 15:29, 9 September 2017


verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps when external genotypes are available. When external genotypes are not available, verifyBamID still robustly detects sample swaps.

Download verifyBamID

To get a copy of verifyBamId, go to: https://github.com/statgen/verifyBamID/releases

Select the latest release and download in one of 3 ways:

  1. Binary expected to run in Ubuntu x64 platform. In other platforms, please download the source distribution and build it.
    • verifyBamID.#.#.#.gz
    • You will need to run "gunzip" on the .gz file
  2. Souce Code including libStatGen (uses a fixed version of libStatGen)
    • verifyBamIDLibStatGen.#.#.#.tgz
    • Run "tar xvf" on this file. Cd into the resulting directory & type make.
  3. Source Code without libStatGen (allows alternative/newer versions of libStatGen)
    • Source code (tar.gz) or Source code (zip)
    • You will need to download libStatGen separately if you do not already have it.


To get a copy of older releases go to the VerifyBamID Download download page.

Join in verifyBamID mailing list

Please join in the VerifyBamID Google Group to ask / discuss / comment about verifyBamID.

What's new

(2014/02/13)

  • Put verifyBamID in github.
  • Added PhoneHome/Version Checking to VerifyBamID

(2012/06/20)

  • Fixed a bug of incorrect estimate of contamination when --chip-full option was used (Thanks to Richard Smith)
  • Fixed a bug of incorrect per-readgroup output in --chip-* parameter

(2012/05/24)

  • Fixed a bug of incorrect per-readgroup output (Thanks to Matthew Flickinger)
  • (IMPORTANT) Add an option to remove either side of overlapping fragment. This option is turned on by default, and can be turned off usig --ignoreOverlapPair. If your sequence data has very short insert size, this update may increase the sensitivity of estimated contamination.
  • Changes in the directory structure and Makefile

(2012/05/18) The new release of verifyBamID have undergone major change since the last version (as of 2011 April). Here are the highlights

  • The genotype / allele frequency file is now based on VCF format rather than PLINK format.
  • The reference sequence information is no longer required
  • Uses Brent's method for precise estimation of contamination parameters
  • Generate the depth distribution statistics.
  • Estimated reference-bias parameters (useful mostly for ABI SOLiD sequence data)

Build verifyBamID

The binary download of verifyBamID is available. You may use that version in Ubuntu 64-bit platform.

If you download the source that includes libStatGen:

tar xvf verifyBamIDLibStatGen.#.#.#.tgz
cd verifyBamID_#.#.#
make
Executable: verifyBamID/bin/verifyBamID

If you download the source without libStatGen:

tar xvf verifyBamID-#.#.#.tar.gz
cd verifyBamID-1.1.0
make cloneLib (if ../libStatGen does not exist)
make
Executable: ./bin/verifyBamID

Note that make cloneLib command will create a directory ../libStatGen under your verifyBamID directory, and make will create binary of verifyBamID under verifyBamID/bin/

If you have a different version of libStatGen at that path, then skip the cloneLib step. If the libStatGen you want to use is at a different location then update verifyBamID's Makefile.inc. Replace: LIB_PATH_VERIFY_BAM_ID ?= $(LIB_PATH_GENERAL) with

LIB_PATH_VERIFY_BAM_ID = /path/to/libStatGen


verifyBamID is designed to be reasonably portable.

However, since development occurs only on Ubuntu (9.10-13.10) x86 and x64 platforms, and later, there are likely other portability issues.


Basic Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. verifyBamID checks whether reads in a BAM file match previous genotypes for a specific sample. In addition, it detects possible sample mixture from population allele frequency only, which can be particularly useful when the genotype data is not available.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, verifyBamID tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

verifyBamID --vcf [input.vcf] --bam [input.bam] --out [output.prefix] --verbose --ignoreRG

where
[input.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[input.vcf] is input VCF file containing individual genotypes or AF or AC/AN fields in the INFO field. gzipped VCF is also allowed.
[outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM,depthRG,depthSM} will be created.

More detailed description of command line input is below

Preparing input files

verifyBamID requires two input files - VCF file containing external genotypes or allele frequency information, and the BAM file.

VCF input genotype file

The input VCF file contains (1) external genotype information and/or (2) allele frequency information as AF entry or AC/AN entries in the INFO field. (See | VCF specification for further details). If neither information is provided, verifyBamID will not work properly.

If external genotype information is provided, sequence+array method will identify contamination and sample swaps by comparing the concordance between the external genotypes and the sequence reads. Additionally, sequence-only method will provide additional contamination estimates by modeling the sequence reads as mixture of two unknown samples based on the allele frequency information in the VCF file.

Input VCF file needs to meet several additional contraints need to meet in order to properly run verifyBamID.

  • The VCF is assumed to be well-formed. For example, verifyBamID does not check whether REF allele actually matches with reference sequence.
  • The VCF should only contain SNPs. Current version of verifyBamID does not accept INDELs, MNPs, Structural Variations, or other complex variants.
  • The individual IDs in the VCF file, must be identical with the individual identifier in the BAM file. Otherwise, --smID option can override the sample ID information of the BAM file to the ID that matches to the individual IDs in the VCF file.
  • IMPORTANT : For targeted sequencing data, it is important to subselect the markers to only include on-target markers in the genotype file. Off-target markers are not likely to have multiple non-duplicated reads at the marker position, and it may create artifacts in the analysis due to overlapping fragments.
  • Currently, verifyBamID takes only autosomal chromosomes as input VCF.

An example input VCF file (without external genotype) is provided below. Note that AC and AC entries exists in the INFO field for the allele frequency information.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
20	61651	SNP20-9651	C	A	.	PASS	CR=99.86851;GentrainScore=0.7055;HW=0.077647716;AN=2180;AC=11
20	63231	SNP20-11231	T	G	.	PASS	CR=99.93036;GentrainScore=0.7837;HW=0.035481825;AN=2182;AC=275
20	63244	rs6139074	A	C	.	PASS	CR=98.893394;GentrainScore=0.8001;HW=7.327299E-7;AN=2162;AC=501
20	63799	rs1418258	C	T	.	PASS	CR=99.75217;GentrainScore=0.8170;HW=0.6653377;AN=2182;AC=881

Input BAM file

verifyBamID requires a sorted, indexed, base quality recalibrated, and duplication-marked BAM file. It also requires to contain "@RG" header lines to annotation different readGroups (sequencing runs and lanes). The SM tag in the "@RG" header should match with one of the genotyped sample. Otherwise, verifyBamID may not be able to test whether the sequenced sample matches with genotyped sample, but will try to detect sample mixture from allele frequency, and will try to detect the best-matching sample among the genotyped sample.

What the default option does

The default option of verifyBamID is the recommended setting for the most sequencing studies to provide a rapid and informative response. The default option provides the following features:

  • --free-mix is turned on for estimating contamination using sequence-only method
  • --chip-mix is turned on for estimating contamination or swap using sequence+array method, if the external genotype file is provided in the VCF
  • --self is turnd on : The default option does not try to compare the sequence reads to identify the best matching individual (which is possible with --best option). It only compares with the external genotypes from the same individual to the sequenced individual.
  • --maxDepth 20 is used without --precise option : The default option is intended for whole genome low coverage sequencing. For the targeted exome sequencing, --maxDepth 1000 and --precise is recommended.
  • --ignoreRG is not a default option, but a recommended option, when you want to check the contamination for the entire BAM rather than examining each read group separately. This option will increase the computational efficiency especially in the case whether the sequence reads are multiplexed across many sequencing runs.

Interpreting output files

See also Understanding VerifyBamID output.

Output files

When verifyBamID runs successfully, the following sets of files may be generated.

  • [outPrefix].selfSM - Per-sample statistics describing how well the sample matches to the annotated sample.
  • [outPrefix].depthSM - The depth distribution of the sequence reads per sample
  • [outPrefix].selfRG - Per-readGroup statistics describing how well each lane matches to the annotated sample. (available only without --ignoreRG option)
  • [outPrefix].depthRG - The depth distribution of the sequence reads per readGroup. (available only without --ignoreRG option)
  • [outPrefix].bestSM - Per-sample best-match statistics with best-matching sample among the genotyped sample (available only with --best option)
  • [outPrefix].bestRG - Per-readgroup best-match statistics with best-matching sample among the genotyped sample (available only with --best and without --ignoreRG option)

Column information in the output files

The .selfSM/.selfRG/.bestSM/.bestRG files have the following 19 columns per sample, or per readgroup (lane).

  1. SEQ_SM : Sample ID of the sequenced sample. Obtained from @RG header / SM tag in the BAM file
  2. RG : ReadGroup ID of sequenced lane. For [outPrefix].selfSM and [outPrefix].bestSM, these values are "ALL"
  3. CHIP_ID : Sample ID compared to in the genotype file. For [outPrefix].selfRG and [outPrefix].selfSM, these values should be identical to [SEQ_SM] or "NA" if the genotype of sequenced samples are unavailable. For [outPrefix].bestRG and [outPrefix].bestSM, these values should be the ID of best-matching sample among the genotype files compared to.
  4. # SNPs : # of SNPs passing the criteria from the VCF file
  5. # READS : Total # of reads loaded from the BAM file
  6. # AVG_DP : Average sequencing depth at the sites in the VCF file
  7. FREEMIX : Sequence-only estimate of contamination (0-1 scale)
  8. FREELK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence-only method
  9. FREELK0 : Log-likelihood of the sequence reads given no contamination under sequence-only method
  10. FREE_RH : Estimated reference bias parameter Pr(refBase|HET) (when --free-refBias or --free-full is used)
  11. FREE_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --free-refBias or --free-full is used)
  12. CHIPMIX : Sequence+array estimate of contamination (NA if the external genotype is unavailable) (0-1 scale)
  13. CHIPLK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence+array method (NA if the external genotypes are unavailable)
  14. CHIPLK0 : Log-likelihood of the sequence reads given no contamination under sequence+array method (NA if the external genotypes are unavailable)
  15. CHIP_RH : Estimated reference bias parameter Pr(refBase|HET) (when --chip-refBias or --chip-full is used)
  16. CHIP_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --chip-refBias or --chip-full is used)
  17. DPREF : Depth (Coverage) of HomRef site (based on the genotypes of (SELF_SM/BEST_SM), passing mapQ, baseQual, maxDepth thresholds.
  18. RDPHET : DPHET/DPREF, Relative depth to HomRef site at Heterozygous site.
  19. RDPALT : DPHET/DPREF, Relative depth to HomRef site at HomAlt site.

A guideline to interpret output files

verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples

  • Each sample or lane can be checked in this way. When [CHIPMIX] >> 0.02 and/or [FREEMIX] >> 0.02, meaning 2% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination.
  • We recommend to check each lane for the possibility of sample swaps. When [CHIPMIX] ~ 1 AND [FREEMIX] ~ 0, then it is possible that the sample is swapped with another sample. When [CHIPMIX] ~ 0 in .bestSM file, [CHIP_ID] might be actually the swapped sample. Otherwise, the swapped sample may not exist in the genotype data you have compared.
  • When genotype data is not available but allele-frequency-based estimates of [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, then it is possible that the sample is contaminated with other sample. We recommend to use per-sample data rather than per-lane data for checking this for low coverage data, because the inference will be more confident when there are large number of bases with depth 2 or higher.

Command Line Options

The following parameters are available.  Ones with "[]" are in effect:

Available Options
                            Input Files : --vcf [], --bam [], --subset [],
                                          --smID []
                   VCF analysis options : --genoError [1.0e-03],
                                          --minAF [0.01],
                                          --minCallRate [0.50]
  Individuals to compare with chip data : --site, --self, --best
         Chip-free optimization options : --free-none, --free-mix [ON],
                                          --free-refBias, --free-full
         With-chip optimization options : --chip-none, --chip-mix [ON],
                                          --chip-refBias, --chip-full
                   BAM analysis options : --ignoreRG, --ignoreOverlapPair,
                                          --noEOF, --precise, --minMapQ [10],
                                          --maxDepth [20], --minQ [13],
                                          --maxQ [40], --grid [0.05]
                Modeling Reference Bias : --refRef [1.00], --refHet [0.50],
                                          --refAlt [0.00]
                         Output options : --out [], --verbose
                              PhoneHome : --noPhoneHome,
                                          --phoneHomeThinning [50]

Each option provides the following features:

  • --vcf : specify required VCF file
  • --bam : specify required BAM file (indexed with .bam.bai or .bai file)
  • --subset : list of individual IDs to calculate the allele frequency. All individuals are used if unspecified
  • --smID : If the individual ID in the BAM file and VCF file does not match, substitute the BAM file's ID into the specified argument
  • --genoError : error rate of the external genotype file
  • --minAF : minimum allele frequency of the markers to include
  • --minAF : minimum call rate of the markers to include
  • --site : If set, use only site information in the VCF and do not compare with the actual genotypes
  • --self : Only compare the ID-matching individuals between the VCF and BAM file
  • --best : Find the best matching individuals (.bestSM and .bestRG files will be produced). This option is substantially longer than the default option
  • --free-none : Do not perform sequence-only method to estimate parameters
  • --free-mix : (default) Estimate contamination using sequence-only method with Brent's single dimensional optimization.
  • --free-refBias : Estimate the reference bias parameters using sequence-only method with Simplex method
  • --free-full : Estimate both reference bias parameters and the contamination parameters using sequence-only method
  • --chip-none : Do not perform sequence+array method to estimate parameters
  • --chip-mix : (default) Estimate contamination using sequence+array method with Brent's single dimensional optimization.
  • --chip-refBias : Estimate the refernece bias parameters using sequence+array method with Simplex method
  • --chip-full : Estimate both reference bias parameters and the contamination parameters using sequence+array method
  • --ignoreRG : ignore the read grouup level comparison and compare samples only (recommended for an expedited run)
  • --ignoreOverlapPair : ignore overlapping pair end fragment covering the same base. Disabling this option may decrease the sensitivity of the method when the insert size is short (with slight gain in the computational speed)
  • --noEOF : do not check the EOF marker of the BAM file (for earlier version of BAM)
  • --precise : calculate the likelihood in log-scale for high-depth data (recommended when --maxDepth is greater than 20. Can be a little bit slower)
  • --minMapQ : minimum mapping quality of the sequence reads to compare
  • --minQ : minimum base quality to include
  • --maxQ : maximum base quality to cap
  • --grid : the grid interval to search the optimum before running Brent's algorithm.
  • --refRef : Initial Pr(refBase|HOMREFGeno) parameter
  • --refHet : Initial Pr(refBase|HETGeno) parameter
  • --refAlt : Initial Pr(refBase|HOMALTGeno) parameter
  • --out : output file prefix (required)
  • --verbose : print the progress of the method on the screeen

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

  • --noPhoneHome disables PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
  • --phoneHomeThinning (0-100) adjusts the frequency of PhoneHome.
    • By default, --phoneHomeThinning is set to 50, running 50% of the time.
    • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
    • N/A if --noPhoneHome is set.

Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

Reference

Please cite the following paper:

G. Jun, M. Flickinger, K. N. Hetrick, Kurt, J. M. Romm, K. F. Doheny, G. Abecasis, M. Boehnke,and H. M. Kang, Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-Based Genotype Data, American journal of human genetics doi:10.1016/j.ajhg.2012.09.004 (volume 91 issue 5 pp.839 - 848)


Contamination in Array Data

VerifyIDintensity or BAFRegress can estimate sample contamination from Illumina genotype array data.

Acknowledgements

VerifyBamID is a result from collaborative effort by Hyun Min Kang, Goo Jun, Matthew Flickinger, Mary Kate Wing, Goncalo Abecasis, and Michael Boehnke. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.