VerifyBamID

verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps when external genotypes are available. When external genotypes are not available, verifyBamID still robustly detects sample swaps.

Download verifyBamID

To get a copy go to the VerifyBamID Download download page.

Join in verifyBamID mailing list

Please join in the | VerifyBamID Google Group to ask / discuss / comment about verifyBamID.

What's new

(2012/06/20)

Fixed a bug of incorrect estimate of contamination when --chip-full option was used (Thanks to Richard Smith)
Fixed a bug of incorrect per-readgroup output in --chip-* parameter

(2012/05/24)

Fixed a bug of incorrect per-readgroup output (Thanks to Matthew Flickinger)
(IMPORTANT) Add an option to remove either side of overlapping fragment. This option is turned on by default, and can be turned off usig --ignoreOverlapPair. If your sequence data has very short insert size, this update may increase the sensitivity of estimated contamination.
Changes in the directory structure and Makefile

(2012/05/18) The new release of verifyBamID have undergone major change since the last version (as of 2011 April). Here are the highlights

The genotype / allele frequency file is now based on VCF format rather than PLINK format.
The reference sequence information is no longer required
Uses Brent's method for precise estimation of contamination parameters
Generate the depth distribution statistics.
Estimated reference-bias parameters (useful mostly for ABI SOLiD sequence data)

Build verifyBamID

The binary download of verifyBamID is available. You may use the version in Ubuntu 64-bit platform. To build verifyBamID, download the statgen library and run the following series of commands

tar xzvf verifyBamID.20120620.tar.gz
cd verifyBamID
make cloneLib (in the case ../libStatGen does not exist)
make
./bin/verifyBamID

Note that git clone command will create a directory ./libStatGen under your working directory, and make will create binary of verifyBamID under verifyBamID/bin/

verifyBamID is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

Currently we support verifyBamID only on Ubuntu 9.10 and later on 64-bit processors.

Basic Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. verifyBamID checks whether reads in a BAM file match previous genotypes for a specific sample. In addition, it detects possible sample mixture from population allele frequency only, which can be particularly useful when the genotype data is not available.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, verifyBamID tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

verifyBamID --vcf [input.vcf] --bam [input.bam] --out [output.prefix] --verbose --ignoreRG

where
[input.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[input.vcf] is input VCF file containing individual genotypes or AF or AC/AN fields in the INFO field. gzipped VCF is also allowed.
[outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM,depthRG,depthSM} will be created.

More detailed description of command line input is below

Preparing input files

verifyBamID requires two input files - VCF file containing external genotypes or allele frequency information, and the BAM file.

VCF input genotype file

The input VCF file contains (1) external genotype information and/or (2) allele frequency information as AF entry or AC/AN entries in the INFO field. (See | VCF specification for further details). If neither information is provided, verifyBamID will not work properly.

If external genotype information is provided, sequence+array method will identify contamination and sample swaps by comparing the concordance between the external genotypes and the sequence reads. Additionally, sequence-only method will provide additional contamination estimates by modeling the sequence reads as mixture of two unknown samples based on the allele frequency information in the VCF file.

Input VCF file needs to meet several additional contraints need to meet in order to properly run verifyBamID.

The VCF is assumed to be well-formed. For example, verifyBamID does not check whether REF allele actually matches with reference sequence.
The VCF should only contain SNPs. Current version of verifyBamID does not accept INDELs, MNPs, Structural Variations, or other complex variants.
The individual IDs in the VCF file, must be identical with the individual identifier in the BAM file. Otherwise, --smID option can override the sample ID information of the BAM file to the ID that matches to the individual IDs in the VCF file.
IMPORTANT : For targeted sequencing data, it is important to subselect the markers to only include on-target markers in the genotype file. Off-target markers are not likely to have multiple non-duplicated reads at the marker position, and it may create artifacts in the analysis due to overlapping fragments.
Currently, verifyBamID takes only autosomal chromosomes as input VCF.

An example input VCF file (without external genotype) is provided below. Note that AC and AC entries exists in the INFO field for the allele frequency information.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
20	61651	SNP20-9651	C	A	.	PASS	CR=99.86851;GentrainScore=0.7055;HW=0.077647716;AN=2180;AC=11
20	63231	SNP20-11231	T	G	.	PASS	CR=99.93036;GentrainScore=0.7837;HW=0.035481825;AN=2182;AC=275
20	63244	rs6139074	A	C	.	PASS	CR=98.893394;GentrainScore=0.8001;HW=7.327299E-7;AN=2162;AC=501
20	63799	rs1418258	C	T	.	PASS	CR=99.75217;GentrainScore=0.8170;HW=0.6653377;AN=2182;AC=881

Input BAM file

verifyBamID requires a sorted, indexed, base quality recalibrated, and duplication-marked BAM file. It also requires to contain "@RG" header lines to annotation different readGroups (sequencing runs and lanes). The SM tag in the "@RG" header should match with one of the genotyped sample. Otherwise, verifyBamID may not be able to test whether the sequenced sample matches with genotyped sample, but will try to detect sample mixture from allele frequency, and will try to detect the best-matching sample among the genotyped sample.

What the default option does

The default option of verifyBamID is the recommended setting for the most sequencing studies to provide a rapid and informative response. The default option provides the following features:

--free-mix is turned on for estimating contamination using sequence-only method
--chip-mix is turned on for estimating contamination or swap using sequence+array method, if the external genotype file is provided in the VCF
--self is turnd on : The default option does not try to compare the sequence reads to identify the best matching individual (which is possible with --best option). It only compares with the external genotypes from the same individual to the sequenced individual.
--maxDepth 20 is used without --precise option : The default option is intended for whole genome low coverage sequencing. For the targeted exome sequencing, --maxDepth 1000 and --precise is recommended.
--ignoreRG is not a default option, but a recommended option, when you want to check the contamination for the entire BAM rather than examining each read group separately. This option will increase the computational efficiency especially in the case whether the sequence reads are multiplexed across many sequencing runs.

Interpreting output files

Output files

When verifyBamID runs successfully, the following sets of files may be generated.

[outPrefix].selfSM - Per-sample statistics describing how well the sample matches to the annotated sample.
[outPrefix].depthSM - The depth distribution of the sequence reads per sample
[outPrefix].selfRG - Per-readGroup statistics describing how well each lane matches to the annotated sample. (available only without --ignoreRG option)
[outPrefix].depthRG - The depth distribution of the sequence reads per readGroup. (available only without --ignoreRG option)
[outPrefix].bestSM - Per-sample best-match statistics with best-matching sample among the genotyped sample (available only with --best option)
[outPrefix].bestRG - Per-readgroup best-match statistics with best-matching sample among the genotyped sample (available only with --best and without --ignoreRG option)

Column information in the output files

The .selfSM/.selfRG/.bestSM/.bestRG files have the following 19 columns per sample, or per readgroup (lane).

SEQ_SM : Sample ID of the sequenced sample. Obtained from @RG header / SM tag in the BAM file
RG : ReadGroup ID of sequenced lane. For [outPrefix].selfSM and [outPrefix].bestSM, these values are "ALL"
CHIP_ID : Sample ID compared to in the genotype file. For [outPrefix].selfRG and [outPrefix].selfSM, these values should be identical to [SEQ_SM] or "NA" if the genotype of sequenced samples are unavailable. For [outPrefix].bestRG and [outPrefix].bestSM, these values should be the ID of best-matching sample among the genotype files compared to.
# SNPs : # of SNPs passing the criteria from the VCF file
# READS : Total # of reads loaded from the BAM file
# AVG_DP : Average sequencing depth at the sites in the VCF file
FREEMIX : Sequence-only estimate of contamination
FREELK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence-only method
FREELK0 : Log-ikelihood of the sequence reads given no contamination under sequence-only method
FREE_RH : Estimated reference bias parameter Pr(refBase|HET) (when --free-refBias or --free-full is used)
FREE_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --free-refBias or --free-full is used)
CHIPMIX : Sequence+array estimate of contamination (NA if the external genotype is unavailable)
CHIPLK1 : Maximum log-likelihood of the sequence reads given estimated contamination under sequence+array method (NA if the external genotypes are unavailable)
CHIPLK0 : Log-likelihood of the sequence reads given no contamination under sequence+array method (NA if the external genotypes are unavailable)
CHIP_RH : Estimated reference bias parameter Pr(refBase|HET) (when --chip-refBias or --chip-full is used)
CHIP_RA : Estimated reference bias parameter Pr(refBase|HOMALT) (when --chip-refBias or --chip-full is used)
DPREF : Depth (Coverage) of HomRef site (based on the genotypes of (SELF_SM/BEST_SM), passing mapQ, baseQual, maxDepth thresholds.
RDPHET : DPHET/DPREF, Relative depth at Heterozygous site.
RDPALT : DPHET/DPREF, Relative depth at HomAlt site.

A guideline to interpret output files

verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples

Each sample or lane can be checked in this way. When [CHIPMIX] >> 0.02 and/or [FREEMIX] >> 0.02, meaning 2% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination.
We recommend to check each lane for the possibility of sample swaps. When [CHIPMIX] ~ 1 AND [FREEMIX] ~ 0, then it is possible that the sample is swapped with another sample. When [CHIPMIX] ~ 0 in .bestSM file, [CHIP_ID] might be actually the swapped sample. Otherwise, the swapped sample may not exist in the genotype data you have compared.
When genotype data is not available but allele-frequency-based estimates of [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, then it is possible that the sample is contaminated with other sample. We recommend to use per-sample data rather than per-lane data for checking this for low coverage data, because the inference will be more confident when there are large number of bases with depth 2 or higher.

Command Line Options

The following parameters are available.  Ones with "[]" are in effect:

Available Options
                            Input Files : --vcf [], --bam [], --subset [],
                                          --smID []
                   VCF analysis options : --genoError [1.0e-03],
                                          --minAF [0.01],
                                          --minCallRate [0.50]
  Individuals to compare with chip data : --site, --self, --best
         Chip-free optimization options : --free-none, --free-mix [ON],
                                          --free-refBias, --free-full
         With-chip optimization options : --chip-none, --chip-mix [ON],
                                          --chip-refBias, --chip-full
                   BAM analysis options : --ignoreRG, --ignoreOverlapPair,
                                          --noEOF, --precise, --minMapQ [10],
                                          --maxDepth [20], --minQ [13],
                                          --maxQ [40], --grid [0.05]
                Modeling Reference Bias : --refRef [1.00], --refHet [0.50],
                                          --refAlt [0.00]
                         Output options : --out [], --verbose

Each option provides the following features:

--vcf : specify required VCF file
--bam : specify required BAM file (indexed with .bam.bai or .bai file)
--subset : list of individual IDs to calculate the allele frequency. All individuals are used if unspecified
--smID : If the individual ID in the BAM file and VCF file does not match, substitute the BAM file's ID into the specified argument
--genoError : error rate of the external genotype file
--minAF : minimum allele frequency of the markers to include
--minAF : minimum call rate of the markers to include
--site : If set, use only site information in the VCF and do not compare with the actual genotypes
--self : Only compare the ID-matching individuals between the VCF and BAM file
--best : Find the best matching individuals (.bestSM and .bestRG files will be produced). This option is substantially longer than the default option
--free-none : Do not perform sequence-only method to estimate parameters
--free-mix : (default) Estimate contamination using sequence-only method with Brent's single dimensional optimization.
--free-refBias : Estimate the refernece bias parameters using sequence-only method with Simplex method
--free-full : Estimate both reference bias parameters and the contamination parameters using sequence-only method
--chip-none : Do not perform sequence+array method to estimate parameters
--free-mix : (default) Estimate contamination using sequence+array method with Brent's single dimensional optimization.
--free-refBias : Estimate the refernece bias parameters using sequence+array method with Simplex method
--free-full : Estimate both reference bias parameters and the contamination parameters using sequence+array method
--ignoreRG : ignore the read grouup level comparison and compare samples only (recommended for an expedited run)
--ignoreOverlapPair : ignore overlapping pair end fragment covering the same base. Disabling this option may decrease the sensitivity of the method when the insert size is short (with slight gain in the computational speed)
--noEOF : do not check the EOF marker of the BAM file (for earlier version of BAM)
--precise : calculate the likelihood in log-scale for high-depth data (recommended when --maxDepth is greater than 20. Can be a little bit slower)
--minMapQ : minimum mapping quality of the sequence reads to compare
--minQ : minimum base quality to include
--maxQ : maximum base quality to cap
--grid : the grid interval to search the optimum before running Brent's algorithm.
--refRef : Initial Pr(refBase|HOMREFGeno) parameter
--refHet : Initial Pr(refBase|HETGeno) parameter
--refAlt : Initial Pr(refBase|HOMALTGeno) parameter
--out : output file prefix (required)
--verbose : print the progress of the method on the screeen

Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

Reference

Please cite the following paper:

G. Jun, M. Flickinger, K. N. Hetrick, Kurt, J. M. Romm, K. F. Doheny, G. Abecasis, M. Boehnke,and H. M. Kang, Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-Based Genotype Data, American journal of human genetics doi:10.1016/j.ajhg.2012.09.004 (volume 91 issue 5 pp.839 - 848)

Acknowledgements

VerifyBamID is a result from collaborative effort by Hyun Min Kang, Goo Jun, Matthew Flickinger, Mary Kate Trost, and Goncalo Abecasis. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.

VerifyBamID

Contents