GlfSingle

From Genome Analysis Wiki
Revision as of 10:34, 26 September 2013 by Yancylo (talk | contribs) (Model for Variant Calling)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

glfSingle is a GLF-based variant caller for next-generation sequencing data. It takes a GLF format genotype likelihood file as input and generates a VCF-format set of variant calls as output.

Basic Usage Example

Here is an example of how glfSingle works:

  glfSingle -g NA19240.chrom20.SLX.glf -b NA19240.chrom20.SLX.vcf > NA19240.chrom20.SLX.log

Command Line Options

 -g genotype likelihood file    Specifies the name of the input GLF-format genotype likelihood file
 -b base call file              Specifies the name of the output VCF-format base call file
 -s sample label                Specifies a label for the sample being analyzed, which will be included in the output VCF file
 -p threshold                   The threshold for base calling. Base calls will be made when their posterior likelihood exceeds threshold
 --minMapQuality threshold      Positions where the root-means squared mapping quality falls below this threshold will be excluded.
 --minDepth      threshold      Positions where the read depth falls below this threshold will be excluded.
 --maxDepth      threshold      Positions where the read depth exceeds this threshold will be excluded.
 --reference                    Positions called as homozygous reference will be included in the output.

To learn about default values for these options, simply run the program with no arguments.

Model for Variant Calling

glfSingle uses a likelihood-based model for variant calling. Starting from genotype likelihoods Pr(reads| genotype) per genomic position, computed from appropriate tools (eg. Samtools BAQ), the likelihoods combine with an individual-based prior p(genotype) to generate posterior probabilities Pr(genotype| reads).

Ingredients that go into prior:

  • All sites have an equal probability of showing polymorphism:
    • P(non-reference base) = 0.001
  • When a site shows polymorphism, it is usually heterozygous:
    • P(non-reference heterozygote) = 0.01 * 2/3
    • P(non-reference homozygote) = 0.01 * 1/3
  • Mutation model: Transitions (C <-> T or A <-> G) accounts for most variants, while transversions account for minority of variants
    • transition has 2/3 probability
    • each transversion has 1/6 probability
  • New implementation: Alternative mutation model with uniform (uninformative) prior for transition to transversion ratio
    • updated by Yancy Lo, 9/24/2012
    • each mutation has a 1/3 probability
    • add --uniformTsTv in command line to enable this alternative mutation model
    • download glfSingle with this new implementation here: File:Generic-glfSingle-2013-09-25.tar.gz

Download

For the current of glfSingle, please go to our GLF Tools Website.

TODO

Support for X chromosome variant calling.

Support for a two pass depth filter that uses the data to automatically work out appropriate filtering thresholds.