Difference between revisions of "BamGenotypeCheck"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(30 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''LaneCheck'''
+
{| style="width:100%; background:#FF8989; margin-top:1.2em; border:1px solid #ccc;" |
 +
| style="width:100%; text-align:center; white-space:nowrap; color:#000;" |
 +
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">This tool has been DEPRECATED, and replaced by [[VerifyBamID]]</div>
 +
|}
  
<br>
+
'''bamGenotypeCheck''' is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals).
  
== Basic Usage Example ==
 
  
Here is an example of how <code>glfTrio</code> works:
+
== Download bamGenotypeCheck  ==
 +
 
 +
To get a copy go to the [http://csg.sph.umich.edu//pha/karma/download/ Karma Download] download page.
 +
 
 +
== Build bamGenotypeCheck  ==
 +
 
 +
Karma (which includes bamGenotypeCheck) is designed to be reasonably portable.
 +
 
 +
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.
 +
 
 +
We support Karma only on Ubuntu 9.10 and later on 64-bit processors.
 +
 
 +
== Usage ==
  
  lanecheck -f NA19239.chrom20.SLX.glf -m NA19238.chrom20.SLX.glf -c NA19240.chrom20.SLX.glf \
+
A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample.  
        --father NA19239 --mother NA19238 --child NA19240 \
 
        --minMapQuality 30 --minTotalDepth 0 --maxTotalDepth 1000 \
 
        -b YRI.chrom20.SLX.vcf &gt; YRI.chrom20.SLX.log
 
  
== Command Line Options ==
+
Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, bamGenotypeCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.
  
=== Input Files ===
+
== Basic Usage Example ==
  
  -f ''genotype likelihood file''    Father's [[GLF|GLF]]-format genotype likelihood file
+
Here is a typical command line:
-m ''genotype likelihood file''    Mother's [[GLF|GLF]]-format genotype likelihood file
 
-c ''genotype likelihood file''    Child's [[GLF|GLF]]-format genotype likelihood file
 
  
=== Basic Output Options ===
+
  bamGenotypeCheck  -r /data/local/ref/karma.ref/human.g1k.v37.fa \
 +
              -k BAMfiles.txt -p test.ped -d test.dat -m test.map
  
  -b ''baseCallFile''                Specifies the name of the output [[VCF|VCF]]-format base call file
+
== Command Line Options ==
-p ''threshold''                  The threshold for base calling. Base calls will be made when their posterior likelihood exceeds ''threshold''
 
--reference                    Positions called as homozygous reference will be included in the output. 
 
--verbose                      Print debug information to the screen
 
  
=== Filtering According to Depth and Map Quality ===
+
=== Input Files ===
  
  --minMapQuality ''threshold''     Positions where the root-means squared mapping quality falls below this threshold will be excluded.
+
-r  ''genome reference in [http://en.wikipedia.org/wiki/Fasta_format simplified FASTA format]''
  --strict                      When the map quality is interpreted ''strictly'', all three trio individuals must exceed ''minMapQuality''  
+
-''allele Frequency file in [[MERLIN format]]''
                              before a call is made. Without the --strict option, reads for individuals below the threshold are ignored.
+
  -''pedigree file in [[MERLIN format]]''
 +
-d  ''data file in [[MERLIN format]]''
 +
-m  ''map file in [[MERLIN format]]''
  
  --minTotalDepth ''threshold''           Positions where the read depth falls below this threshold will be excluded.
+
-k  ''a list of BAM files to check''
  --maxTotalDepth ''threshold''           Positions where the read depth exceeds this threshold will be excluded.
+
-c [int]  ''stop after reading [int] filtered sequence reads''
 +
  -C [int]  ''stop after reading [int] reads, filtered or not''
  
=== Sample Labels ===
+
=== Output Options ===
  
  --father ''fatherLabel''          Specifies a label for the male parent, which will be included in the output VCF file
+
-v ''verbose output''
  --mother ''motherLabel''          Specifies a label for the female parent, which will be included in the output VCF file
 
--child ''childLabel''             Specifies a label for the child, which will be included in the output VCF file
 
  
=== X Chromosome Variant Calling ===
+
=== Filtering ===
  
   --xChr ''chromosomeName''          Label for the 'X' chromosome in the GLF file
+
-b [int]   ''exclude bases with quality less than [int]''
  --xStart ''sexChromosomeStart''   Start of the non-pseudo-autosomal portion of the X (2,709,521 bp in build 36)
+
  -M [int]  ''exclude reads with map quality less than [int]''
  --xStop ''sexChromosomeEnd''       End of the non-pseudo-autosomal portion of the X (154,584,237 bp in build 36)
+
  -f [float] ''drop markers with minor allele frequency smaller than [float]''
 +
-F [int]  ''set custom BAM flags filter (not implemented at the moment)''
  
Principle of operation:
+
=== Other Options ===
  
The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"
+
-e [float] '' set minimum error base error to [float]''
  
            excess rate  = (informative rate  -  background rate).
+
== Principle of Operation ==
  
"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP.
+
Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.
  
abc
+
For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.
  
<br>
+
Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.
  
== '''TODO''' ==
+
For more about the technical details, see the page [[Verifying Sample Identities - Implementation]]
Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.
 
  
The "Read group classifier" is an extended regular expression such as '\(^[^.:]+\)[.:].*' which matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs. The regular expression is passed into the lane checking program as an ascii string. The program keeps track of all distinct values it has seen for the matched portion, and must keep a separate tally of matches and mismatches for each combination of [read group x candidate individual]. By itself, the matched portion of each read identifier does not fully specify which original .fastq file a read came from. The 'bitwise flag' value in column 2 of the .sam format has the remaining information. This is able to distinguish between the 'left end', 'right end' and 'single end' reads which come from each Illumina paired-end sequencing run. The Baylor LS 454 data were all single end reads, so I did not have to deal with this complication.<br>
+
== TODO ==

Latest revision as of 11:11, 2 February 2017

This tool has been DEPRECATED, and replaced by VerifyBamID

bamGenotypeCheck is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals).


Download bamGenotypeCheck

To get a copy go to the Karma Download download page.

Build bamGenotypeCheck

Karma (which includes bamGenotypeCheck) is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

We support Karma only on Ubuntu 9.10 and later on 64-bit processors.

Usage

A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample.

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, bamGenotypeCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

Basic Usage Example

Here is a typical command line:

  bamGenotypeCheck  -r /data/local/ref/karma.ref/human.g1k.v37.fa \
             -k BAMfiles.txt -p test.ped -d test.dat -m test.map

Command Line Options

Input Files

-r  genome reference in simplified FASTA format
-a  allele Frequency file in MERLIN format
-p  pedigree file in MERLIN format
-d  data file in MERLIN format
-m  map file in MERLIN format
-k  a list of BAM files to check
-c [int]  stop after reading [int] filtered sequence reads
-C [int]  stop after reading [int] reads, filtered or not

Output Options

-v  verbose output

Filtering

-b [int]   exclude bases with quality less than [int]
-M [int]   exclude reads with map quality less than [int]
-f [float] drop markers with minor allele frequency smaller than [float]
-F [int]   set custom BAM flags filter (not implemented at the moment)

Other Options

-e [float]  set minimum error base error to [float]

Principle of Operation

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.

For more about the technical details, see the page Verifying Sample Identities - Implementation

TODO