Difference between revisions of "VerifyBamID"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 39: Line 39:
 
  USAGE:  
 
  USAGE:  
 
   
 
   
   verifyBamID  [--ibdUnit <double>] [--mixUnit <double>] [-f <double>]
+
   ./verifyBamID  [--ibdUnit <double>] [--mixUnit <double>] [-f <double>]
 
                   [-g <double>] [-d <integer>] [-m <integer>] [-Q
 
                   [-g <double>] [-d <integer>] [-m <integer>] [-Q
 
                   <integer>] [-q <integer>] [-B <string>] [-b <string>] [-l
 
                   <integer>] [-q <integer>] [-B <string>] [-b <string>] [-l
 
                   <string>] -o <string> [-I <string>] [-s <string>] [-p
 
                   <string>] -o <string> [-I <string>] [-s <string>] [-p
 
                   <string>] [-i <string>] -r <string> [-M] [-F] [-S] [-n]
 
                   <string>] [-i <string>] -r <string> [-M] [-F] [-S] [-n]
                   [-v] [--] [--version] [-h]  
+
                   [-v] [--] [--version] [-h]
 +
 
   
 
   
 
  Where:  
 
  Where:  
 
   
 
   
 
   --ibdUnit <double>
 
   --ibdUnit <double>
     unit of IBD values
+
     unit of IBD values (default: 0.01)
 
   
 
   
 
   --mixUnit <double>
 
   --mixUnit <double>
     unit of % mixture
+
     unit of % mixture (default:0.01)
 
   
 
   
 
   -f <double>,  --minAF <double>
 
   -f <double>,  --minAF <double>
     Minimum allele frequency
+
     Minimum allele frequency (default: 0.005). Markers with lower allele
 +
    frequencies will be ignored
 
   
 
   
 
   -g <double>,  --genoError <double>
 
   -g <double>,  --genoError <double>
     Genotyping error rate
+
     Error rate in genotype data (default: 0.005)
 
   
 
   
 
   -d <integer>,  --maxDepth <integer>
 
   -d <integer>,  --maxDepth <integer>
     Maximum depth per site
+
     Maximum depth per site (default:20) - bases with higher depth will be
 +
    ignored due to possible alignment artifacts. For deep coverage data,
 +
    it is important to set this value to a sufficiently large number (e.g.
 +
    200)
 
   
 
   
 
   -m <integer>,  --minMapQ <integer>
 
   -m <integer>,  --minMapQ <integer>
     Minimum mapping quality value
+
     Minimum mapping quality value (default:10) - reads with lower quality
 +
    will be ignored
 
   
 
   
 
   -Q <integer>,  --maxQ <integer>
 
   -Q <integer>,  --maxQ <integer>
     Maximum base quality value
+
     Maximum Phred-scale base quality value (default:40) - higher base
 +
    quality will be enforced to be this value
 
   
 
   
 
   -q <integer>,  --minQ <integer>
 
   -q <integer>,  --minQ <integer>
     Minimum base quality value
+
     Minimum Phred-scale base quality value (default:20) - bases with lower
 
+
    quality will be ignored
 +
 
   -B <string>,  --bimpfile <string>
 
   -B <string>,  --bimpfile <string>
     PLINK format BIMP file with allele frequency information at the last
+
     PLINK format BIM file with allele frequency information at the last
 
     column
 
     column
 
   
 
   
 
   -b <string>,  --bfile <string>
 
   -b <string>,  --bfile <string>
     PLINK format genotype file prefix
+
     Binary PLINK format genotype file prefix. Must be forward-stranded
 
   
 
   
 
   -l <string>,  --log <string>
 
   -l <string>,  --log <string>
     Log file - [out].log will be default value
+
     Log file - default: [out].log
 
   
 
   
 
   -o <string>,  --out <string>
 
   -o <string>,  --out <string>
     (required)  prefix output file
+
     (required)  Prefix of output files
 
   
 
   
 
   -I <string>,  --index <string>
 
   -I <string>,  --index <string>
Line 89: Line 97:
 
   
 
   
 
   -s <string>,  --insuffix <string>
 
   -s <string>,  --insuffix <string>
     Suffix of input BAM file - multichr inputs
+
     Suffix of input BAM file for multi-chromosome inputs
 
   
 
   
 
   -p <string>,  --inprefix <string>
 
   -p <string>,  --inprefix <string>
     Prefix of input BAM file - multichr inputs
+
     Prefix of input BAM file for multi-chromosome inputs
 
   
 
   
 
   -i <string>,  --in <string>
 
   -i <string>,  --in <string>
     Input BAM file
+
     Input BAM file. Must be sorted and indexed
 
   
 
   
 
   -r <string>,  --reference <string>
 
   -r <string>,  --reference <string>
 
     (required)  Karma's reference sequence
 
     (required)  Karma's reference sequence
 
+
 
   -M,  --mmap
 
   -M,  --mmap
     toggle (default:true) whether to use memory map for loading index file
+
     toggle whether to use memory map (default:true)
 
   
 
   
 
   -F,  --bimAF
 
   -F,  --bimAF
Line 108: Line 116:
 
   
 
   
 
   -S,  --selfonly
 
   -S,  --selfonly
     compare the genotypes with SELF (annotated sample) only to increase the speed
+
     compare the genotypes with SELF (annotated sample) only (default:off)
 
   
 
   
 
   -n,  --noeof
 
   -n,  --noeof
     Do not check EOF marker for BAM file
+
     Do not check EOF marker for BAM file (default:off)
 
   
 
   
 
   -v,  --verbose
 
   -v,  --verbose
     Verbose mode
+
     Toggle verbose mode (default:off)
 
   
 
   
 
   --,  --ignore_rest
 
   --,  --ignore_rest
Line 124: Line 132:
 
   -h,  --help
 
   -h,  --help
 
     Displays usage information and exits.
 
     Displays usage information and exits.
 
  
 
== Principle of Operation ==
 
== Principle of Operation ==

Revision as of 10:29, 4 September 2010


verifyBamID is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples.

Download verifyBamID

To get a copy go to the VerifyBamID Download download page.

Build verifyBamID

verifyBamID is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

Currently we support verifyBamID only on Ubuntu 9.10 and later on 64-bit processors.

Basic Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. verifyBamID checks whether reads in a BAM file match previous genotypes for a specific sample. In addition, it detects possible sample mixture from population allele frequency only, which can be particularly useful when the genotype data is not available.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, verifyBamID tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

verifyBamID  --reference [reference.fa] --in [inputReads.bam] --bfile [inputGenotypes] --out [outPrefix] --verbose

where
[reference.fa] is a FASTA format file, preferably (but not necessarily) indexed a priori with karma or other libcsg-compatible software
[inputReads.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[inputGenotypes] is input prefix of forward-stranded PLINK binary file
[outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM} will be created.

More detailed description of command line input is provided below

Command Line Options

USAGE: 

  ./verifyBamID  [--ibdUnit <double>] [--mixUnit <double>] [-f <double>]
                 [-g <double>] [-d <integer>] [-m <integer>] [-Q
                 <integer>] [-q <integer>] [-B <string>] [-b <string>] [-l
                 <string>] -o <string> [-I <string>] [-s <string>] [-p
                 <string>] [-i <string>] -r <string> [-M] [-F] [-S] [-n]
                 [-v] [--] [--version] [-h]


Where: 

  --ibdUnit <double>
    unit of IBD values (default: 0.01)

  --mixUnit <double>
    unit of % mixture (default:0.01)

  -f <double>,  --minAF <double>
    Minimum allele frequency (default: 0.005). Markers with lower allele
    frequencies will be ignored

  -g <double>,  --genoError <double>
    Error rate in genotype data (default: 0.005)

  -d <integer>,  --maxDepth <integer>
    Maximum depth per site (default:20) - bases with higher depth will be
    ignored due to possible alignment artifacts. For deep coverage data,
    it is important to set this value to a sufficiently large number (e.g.
    200)

  -m <integer>,  --minMapQ <integer>
    Minimum mapping quality value (default:10) - reads with lower quality
    will be ignored

  -Q <integer>,  --maxQ <integer>
    Maximum Phred-scale base quality value (default:40) - higher base
    quality will be enforced to be this value

  -q <integer>,  --minQ <integer>
    Minimum Phred-scale base quality value (default:20) - bases with lower
    quality will be ignored

  -B <string>,  --bimpfile <string>
    PLINK format BIM file with allele frequency information at the last
    column

  -b <string>,  --bfile <string>
    Binary PLINK format genotype file prefix. Must be forward-stranded

  -l <string>,  --log <string>
    Log file - default: [out].log

  -o <string>,  --out <string>
    (required)  Prefix of output files

  -I <string>,  --index <string>
    Index of input BAM file - [inputBam].bai will be default value

  -s <string>,  --insuffix <string>
    Suffix of input BAM file for multi-chromosome inputs

  -p <string>,  --inprefix <string>
    Prefix of input BAM file for multi-chromosome inputs

  -i <string>,  --in <string>
    Input BAM file. Must be sorted and indexed

  -r <string>,  --reference <string>
    (required)  Karma's reference sequence

  -M,  --mmap
    toggle whether to use memory map (default:true)

  -F,  --bimAF
    use the allele frequency information by loading .bimp file instead of
    .bim file

  -S,  --selfonly
    compare the genotypes with SELF (annotated sample) only (default:off)

  -n,  --noeof
    Do not check EOF marker for BAM file (default:off)

  -v,  --verbose
    Toggle verbose mode (default:off)

  --,  --ignore_rest
    Ignores the rest of the labeled arguments following this flag.

  --version
    Displays version information and exits.

  -h,  --help
    Displays usage information and exits.

Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

TODO