Difference between revisions of "VerifyBamID"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 139: Line 139:
 
=== A guideline to interpret output files ===
 
=== A guideline to interpret output files ===
  
verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples  
+
verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples  
  
 
* When [SELF-IBD] < 1 AND [%MIX] > 0 AND [REF-A2%] > 0.01, meaning 1% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination. Each sample or lane can be checked in this way.
 
* When [SELF-IBD] < 1 AND [%MIX] > 0 AND [REF-A2%] > 0.01, meaning 1% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination. Each sample or lane can be checked in this way.

Revision as of 14:30, 6 September 2010


verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples.

Download verifyBamID

To get a copy go to the VerifyBamID Download download page.

Build verifyBamID

verifyBamID is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

Currently we support verifyBamID only on Ubuntu 9.10 and later on 64-bit processors.

Basic Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. verifyBamID checks whether reads in a BAM file match previous genotypes for a specific sample. In addition, it detects possible sample mixture from population allele frequency only, which can be particularly useful when the genotype data is not available.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, verifyBamID tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

verifyBamID  --reference [reference.fa] --in [inputReads.bam] --bfile [inputGenotypes] --out [outPrefix] --verbose

where
[reference.fa] is a FASTA format file, preferably (but not necessarily) indexed a priori with karma or other libcsg-compatible software
[inputReads.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[inputGenotypes] is input prefix of forward-stranded PLINK binary file
[outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM} will be created.

More detailed description of command line input is provided below

Preparing input files

verifyBamID requires three input files - reference FASTA file, PLINK-format binary genotype file, and BAM file.

Reference FASTA format file

Reference file should be in FASTA format (typically [reference].fa). verifyBamID uses libcsg-format for efficiently indexing the [reference].fa file, creating [reference]-bs.umfa file in the directory where the FASTA file exists. If your FASTA file is not previously indexed, please make sure that the directory is writable. Otherwise, it will report errors when creating the index.

PLINK-format binary input genotype file

verifyBamID requires either PLINK-compatible binary genotype file [inputGenotypes].{bim,bed,fam} or allele frequency files [inputGenotypes].{bimp} file. For detailed information of plink format please refer to http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed

Input genotype file needs to meet several additional contraints need to meet in order to properly run verifyBamID.

  • The genotype file must be forward-stranded. In other words, in the marker information [inputGenotypes].bim file, either of the two alleles must match to the reference base provided as [reference].fa file. The marker information should also be sorted by genomic positions.
  • The individual IDs in [inputGenotypes].fam file, must be compatible with the sequenced individual. In the header of the bamfile (which can be view by 'samtools view -H [inputReads].bam'), Each header starts with '@RG' contains 'SM' tag, and the representation of individual ID must be identical to the matching individual's ID in [inputGenotypes].fam file. For more information of BAM file specification, please refer to http://samtools.sourceforge.net/
  • The genotype file is assumed to be in SNP-major format (most binary PLINK formats are in SNP-major format by default).
  • IMPORTANT : For targeted sequencing data, it is important to subselect the markers to only include on-target markers in the genotype file. Off-target markers are not likely to have multiple non-duplicated reads at the marker position, and it may create artifacts in the analysis.

When the genotype data of the sequenced individual is not available, sample contamination can possibly be detected by using population minor allele frequency. If the sequence reads are better explained as a mixture of multiple individuals rather than a single individual, the sample can be flagged for suspected contamination. verifyBamID takes alternative format for allele-frequency only data, which is the same format to PLINK's .bim format, with one additional column, as follows

[CHROM] [SNP_ID] [CentiMorgan (could be 0)] [BASE-POSITION] [ALLELE1] [ALLELE2] [ALLELE1-FREQ]

Below is example lines of .bimp file, a modified file with ALLELE1-FREQUENCY.

1	rs3131972	0	752721	A	G	0.1652
1	rs3131969	0	754182	A	G	0.1339
1	rs1048488	0	760912	C	T	0.1667
1	rs12562034	0	768448	A	G	0.1027
1	rs12124819	0	776546	G	A	0.2902
1	rs4040617	0	779322	G	A	0.1295
1	rs4970383	0	838555	A	C	0.1964
1	rs4475691	0	846808	T	C	0.1429
1	rs1806509	0	853954	C	A	0.3884
1	rs7537756	0	854250	G	A	0.1607

The HapMap3 CEU .bimp file can be downloaded at VerifyBamID Download page

Input BAM file

verifyBamID requires a sorted, indexed, base quality recalibrated, and duplication-marked BAM file. It also requires to contain "@RG" header lines to annotation different readGroups (sequencing runs and lanes). The SM tag in the "@RG" header should match with one of the genotyped sample. Otherwise, verifyBamID may not be able to test whether the sequenced sample matches with genotyped sample, but will try to detect sample mixture from allele frequency, and will try to detect the best-matching sample among the genotyped sample.

Two types of input BAM files can be allowed.

Single file, whole genome BAM file

If an alignment file is a single BAM file across the genome, --in [inputReads].bam can be used to take the BAM file as input.

Multiple files separated by chromosomes

If the alignment file is split per chromosome (as in 1000 genomes project BAM file), --inprefix [prefix] --insuffix [suffix] should be used to take the BAM input files.

For example, if the following series of input BAM files need to be used as input files

NA06985.chrom1.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
NA06985.chrom2.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
NA06985.chrom3.ILLUMINA.bwa.CEU.low_coverage.20100311.bam
....
NA06985.chrom22.ILLUMINA.bwa.CEU.low_coverage.20100311.bam

The following option should be used for input files (note dot after --insuffx)

--inprefix NA06985.chrom --insuffix .ILLUMINA.bwa.CEU.low_coverage.20100311.bam

Interpreting output files

Output files

When verifyBamID runs successfully, it generates the following four (or two) files

  • [outPrefix].selfSM - Per-sample statistics describing how well the sample matches to the annotated sample.
  • [outPrefix].selfRG - Per-readGroup statistics describing how well each lane matches to the annotated sample.
  • [outPrefix].bestSM - Per-sample best-match statistics with best-matching sample among the genotyped sample (won't be produced if there is no individual to compare to)
  • [outPrefix].bestRG - Per-readgroup best-match statistics with best-matching sample among the genotyped sample (won't be produced if there is no individual to compare to)

Column information in the output files

Each of these files have the following 27 columns per sample, or per readgroup (lane). When individual genotypes are unavailable, The values of 3-20 will be "N/A"

  1. SEQ_SM : Sample ID of the sequenced sample. Obtained from @RG header / SM tag in the BAM file
  2. RG : ReadGroup ID of sequenced lane. For [outPrefix].selfSM and [outPrefix].bestSM, these values are "N/A"
  3. SELF_SM/BEST_SM : Sample ID compared to in the genotype file. For [outPrefix].selfRG and [outPrefix].selfSM, these values should be identical to [SEQ_SM] or "N/A" if the genotype of sequenced samples are unavailable. For [outPrefix].bestRG and [outPrefix].bestSM, these values should be the ID of best-matching sample among the genotype files compared to.
  4. SELFIBD/BESTIBD : Maximum-likelihood estimation of fraction of reads in IBD(identity-by-descent) with SELF_SM/BEST_SM in the genotyped sample. 1-[SELFIBD] can be used as an estimated amount of contamination, unless sample swap has occurred.
  5. SELFIBDLLK/BESTIBDLLK : Log likelihood of the sequence reads given the MLE SELFIBD/BESTIBD with SELF_SM/BEST_SM
  6. SELFIBDLK-/BESTIBDLK- : Difference of log-likelihood between SELFIBDLLK/BESTIBDLLK and the likelihood of reads under no contamination (SELFIBD=1).
  7. #GENOS : Number of non-missing genotypes in SELF_SM/BEST_SM
  8. #BASES : Number of bases in the sequenced reads (passing mapQ, baseQual, maxDepth threshold) matching the marker sites
  9. %GENREF : Fraction of HomRef alleles in the genotypes across all markers, for SELF_SM/BEST_SM
  10. %GENHET : Fraction of Heterozygous alleles in the genotypes across all markers, for SELF_SM/BEST_SM
  11. %GENALT : Fraction of HomAlt alleles in the genotypes across all markers, for SELF_SM/BEST_SM
  12. DPREF : Depth (Coverage) of HomRef site (based on the genotypes of (SELF_SM/BEST_SM), passing mapQ, baseQual, maxDepth thresholds.
  13. RDPHET : DPHET/DPREF, Relative depth at Heterozygous site.
  14. RDPALT : DPHET/DPREF, Relative depth at HomAlt site.
  15. REF-A1% : Fraction of reference bases in HomRef site
  16. REF-A2% : Fraction of alternative bases in HomRef site
  17. HET-A1% : Fraction of reference bases in Heterozygous site
  18. HET-A2% : Fraction of alternative bases in Heterozygous site
  19. ALT-A1% : Fraction of reference bases in HomAlt site
  20. ALT-A2% : Fraction of alternative bases in HomAlt site
  21. #DP>1 : Number of sites with depth of 2 or greater
  22. %HETAF : Fraction of heterozygous alleles estimated by population allele frequency
  23. %HETSEQ : Fraction of heterozygous alleles estimated from sequence data (without assuming base errors)
  24. EXHET : [%HETSEQ]/[%HETAF]
  25. %MIX : Maximum-likelihood estimate of % of contamination based on two-sample mixture model from population allele frequency.
  26. BESTMIXLLK : Log-likelihood of the data given the MLE %MIX
  27. BESTMIXLLK- : Difference of log-likelihood between BESTMIXLLK and the likelihood of reads under no contamination (%MIX=0).

A guideline to interpret output files

verifyBamID provides a series of information that is informative to determine whether the sample is possibly contaminated or swapped, but there is no single criteria that works for every circumstances. There are a few unmodeled factor in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the MLE estimation may not always exactly match to the true amount of contamination. Here we provide a guideline to flag potentially contaminated/swapped samples

  • When [SELF-IBD] < 1 AND [%MIX] > 0 AND [REF-A2%] > 0.01, meaning 1% or more of non-reference bases are observed in reference sites, we recommend to examine the data more carefully for the possibility of contamination. Each sample or lane can be checked in this way.
  • When [SELF-IBD] << 1 AND [%MIX] ~ 0, then it is possible that the sample is swapped with another sample. When [BEST-IBD] ~ 1, [BEST_SM] might be actually the swapped sample. Otherwise, the swapped sample may not exist in the genotype data you have compared against. We recommend to check each lane for the possibility of sample swpas.
  • When genotype data is not available but allele-frequency-based estimates of [%MIX] >= 0.03 and [BESTMIXLLK-] is large (greater than 100), then it is possible that the sample is contaminated with other sample. We recommend to use per-sample data rather than per-lane data for checking this for low coverage data, because the inference will be more confident when there are large number of bases with depth 2 or higher.

Command Line Options

USAGE: 

  ./verifyBamID  [--ibdUnit <double>] [--mixUnit <double>] [-f <double>]
                 [-g <double>] [-d <integer>] [-m <integer>] [-Q
                 <integer>] [-q <integer>] [-B <string>] [-b <string>] [-l
                 <string>] -o <string> [-I <string>] [-s <string>] [-p
                 <string>] [-i <string>] -r <string> [-M] [-F] [-S] [-n]
                 [-v] [--] [--version] [-h]


Where: 

  --ibdUnit <double>
    unit of IBD values (default: 0.01)

  --mixUnit <double>
    unit of % mixture (default:0.01)

  -f <double>,  --minAF <double>
    Minimum allele frequency (default: 0.005). Markers with lower allele
    frequencies will be ignored

  -g <double>,  --genoError <double>
    Error rate in genotype data (default: 0.005)

  -d <integer>,  --maxDepth <integer>
    Maximum depth per site (default:20) - bases with higher depth will be
    ignored due to possible alignment artifacts. For deep coverage data,
    it is important to set this value to a sufficiently large number (e.g.
    200)

  -m <integer>,  --minMapQ <integer>
    Minimum mapping quality value (default:10) - reads with lower quality
    will be ignored

  -Q <integer>,  --maxQ <integer>
    Maximum Phred-scale base quality value (default:40) - higher base
    quality will be enforced to be this value

  -q <integer>,  --minQ <integer>
    Minimum Phred-scale base quality value (default:20) - bases with lower
    quality will be ignored

  -B <string>,  --bimpfile <string>
    PLINK format BIM file with allele frequency information at the last
    column

  -b <string>,  --bfile <string>
    Binary PLINK format genotype file prefix. Must be forward-stranded

  -l <string>,  --log <string>
    Log file - default: [out].log

  -o <string>,  --out <string>
    (required)  Prefix of output files

  -I <string>,  --index <string>
    Index of input BAM file - [inputBam].bai will be default value

  -s <string>,  --insuffix <string>
    Suffix of input BAM file for multi-chromosome inputs

  -p <string>,  --inprefix <string>
    Prefix of input BAM file for multi-chromosome inputs

  -i <string>,  --in <string>
    Input BAM file. Must be sorted and indexed

  -r <string>,  --reference <string>
    (required)  Karma's reference sequence

  -M,  --mmap
    toggle whether to use memory map (default:true)

  -F,  --bimAF
    use the allele frequency information by loading .bimp file instead of
    .bim file

  -S,  --selfonly
    compare the genotypes with SELF (annotated sample) only (default:off)

  -n,  --noeof
    Do not check EOF marker for BAM file (default:off)

  -v,  --verbose
    Toggle verbose mode (default:off)

  --,  --ignore_rest
    Ignores the rest of the labeled arguments following this flag.

  --version
    Displays version information and exits.

  -h,  --help
    Displays usage information and exits.

Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

Acknowledgements

VerifyBamID is a result from collaborative effort by Hyun Min Kang, Paul Anderson, Goo Jun, Mary Kate Trost, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.