VerifyBamID

From Genome Analysis Wiki
Revision as of 14:27, 3 September 2010 by Hmkang (talk | contribs) (Created page with 'VerifyBamID '''verifyBamID''' is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples.

Download bamGenotypeCheck

To get a copy go to the VerifyBamID Download download page.

Build verifyBamID

verifyBamID is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

Currently we support verifyBamID only on Ubuntu 9.10 and later on 64-bit processors.

Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, bamGenotypeCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

  verifyBamID  --reference /data/local/ref/karma.ref/human.g1k.v37.fa --in inputReads.bam --bfile inputGenotypes --out outPrefix --verbose

Command Line Options

USAGE: 

  bin/VerifyBamIDMix  [--ibdUnit <double>] [--mixUnit <double>] [-f
                      <double>] [-g <double>] [-d <integer>] [-m
                      <integer>] [-Q <integer>] [-q <integer>] [-B
                      <string>] [-b <string>] [-l <string>] -o <string>
                      [-I <string>] [-s <string>] [-p <string>] [-i
                      <string>] -r <string> [-F] [-S] [-n] [-v] [--]
                      [--version] [-h]


Where: 

  --ibdUnit <double>
    unit of IBD values

  --mixUnit <double>
    unit of % mixture

  -f <double>,  --minAF <double>
    Minimum allele frequency

  -g <double>,  --genoError <double>
    Genotyping error rate

  -d <integer>,  --maxDepth <integer>
    Maximum depth per site

  -m <integer>,  --minMapQ <integer>
    Minimum mapping quality value

  -Q <integer>,  --maxQ <integer>
    Maximum base quality value

  -q <integer>,  --minQ <integer>
    Minimum base quality value
 
  -B <string>,  --bimpfile <string>
    PLINK format BIMP file with allele frequency information at the last
    column

  -b <string>,  --bfile <string>
    PLINK format genotype file prefix

  -l <string>,  --log <string>
    Log file - [out].log will be default value

  -o <string>,  --out <string>
    (required)  prefix output file

  -I <string>,  --index <string>
    Index of input BAM file - [inputBam].bai will be default value

  -s <string>,  --insuffix <string>
    Suffix of input BAM file - multichr inputs

  -p <string>,  --inprefix <string>
    Prefix of input BAM file - multichr inputs

  -i <string>,  --in <string>
    Input BAM file

  -r <string>,  --reference <string>
    (required)  Karma's reference sequence

  -F,  --bimAF
    use the allele frequency information by loading .bimp file instead of
    .bim file

  -S,  --selfonly
    compare the genotypes with SELF (annotated sample) only to increase the speed

  -n,  --noeof
    Do not check EOF marker for BAM file

  -v,  --verbose
    Verbose mode

  --,  --ignore_rest
    Ignores the rest of the labeled arguments following this flag.

  --version
    Displays version information and exits.

  -h,  --help
    Displays usage information and exits.


Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

TODO