GlfMultiples

From Genome Analysis Wiki
Jump to navigationJump to search

glfMultiples is a GLF-based variant caller for next-generation sequencing data. It takes a set of GLF format genotype likelihood files as input and generates a VCF-format set of variant calls as output.

Basic Usage Example

In a typical command line, a series of options controlling variant calling appear first and are followed by a trailing list of GLF-format likelihood files. Here is an example of how glfMultiples works:

  glfMultiples --minMapQuality 30 --minTotalDepth 60 --maxTotalDepth 240 -b YRI.SLX.vcf YRI/NA*.SLX.glf > YRI.SLX.log

Command Line Options

Basic Output Options

 -b baseCallFile                Specifies the name of the output VCF-format base call file
 -p threshold                   The threshold for base calling. Base calls will be made when their posterior likelihood exceeds threshold

Filtering According to Depth and Map Quality

 --minMapQuality threshold      Positions where the root-means squared mapping quality falls below this threshold will be excluded.
 --strict                       When the map quality is interpreted strictly, all three trio individuals must exceed minMapQuality 
                                before a call is made. Without the --strict option, reads for individuals below the threshold are ignored.
 --minDepth threshold           Positions where the read depth falls below this threshold will be excluded.
 --maxDepth threshold           Positions where the read depth exceeds this threshold will be excluded.
 --hardFilter                   Filtered positions will be completely absent from output. The default is to use a soft filter, where these
                                positions are included in output but annotated as failing specific filters.

VCF Output

 --glfAliases filename          By default, GLF filenames are used to label each column in the VCF file. This option allows each filename
                                to be matched to a more specific individual identifier. The aliases file should include two columns per row,
                                the first specifying the VCF filename, the second specifying a sample name.

How It Works

For each possible position, glfMultiples considers a series of potential polymorphisms, including transitions and transversions from the reference base, but also bi-allelic polymorphisms where neither of the alleles present in the sample is the reference base. For each potential polymorphism type, the likelihood of the observed bases is maximized with respect to allele frequency. Decisions of which sites are polymorphic take into account the maximized likelihood but also an overall prior for each type of polymorphism (for example, transitions are assumed to account for ~2/3 of all variants).

glfMultiples works with log-likelihoods internally to avoid underflows in samples that may include hundreds or thousands of individuals.

Download

The current version is available for download from http://www.sph.umich.edu/csg/abecasis/downloads/generic-glfMultiples-2010-06-16.tar.gz.

TODO

Support for X chromosome variant calling.

Support for two-pass depth filter that looks at the data to work out appropriate thresholds for shallow and deep coverage.