Polymutt: a tool for calling polymorphism and de novo mutations in families for sequencing data
- The program polymutt implemented a likelihood-based framework for calling single nucleotide variants and detecting de novo point mutation events in families for next-generation sequencing data. The program takes as input genotype likelihood format (GLF) files which can be generated following the Creation of GLF files instruction and outputs the result in the [VCF] format. The variant calling and de novo mutation detection are modelled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops. The input is a set of GLF files for each of family members and the relationships are specified through the .ped file.
- The evidence of variants and de novo mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for de novo mutation events a de novo quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing de novo mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
- Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled.
- If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced.
- See below for more details.
A command without any input will display the basic usage
The following parameters are in effect: pedfile : test.ped (-pname) datfile : test.dat (-dname) glfIndexFile : test.gif (-gname) posterior cutoff : 0.900 (-c99.999)
Additional Options Map Quality Filter : --minMapQuality Depth Filter : --minDepth , --maxDepth , --minPercSampleWithData [0.00] Scaled mutation rate : --theta [1.0e-03] Prior of ts/tv ratio : --tstv [2.00] de novo mutation : --denovo, --rate_denovo [0.00], --tstv_denovo [2.00], --minLLR_denovo [1.00] Optimization precision : --prec [1.0e-04] Multiple threading : --nthreads  Output : --vcf [test.vcf], --gl_off
An example command for variant calling looks like the following:
polymutt -p in.ped -d in.dat -g glfIndexFile --vcf out.vcf --nthreads 4
An example command for de novo mutation detection is as follows:
polymutt -p in.ped -d in.dat -g glfIndexFile --denovo --rate_denovo 1.5e-08 --min_denovo_LLR 1.0 --vcf out.denovo.vcf --nthreads 4
Required input files are -p input.ped -d input.dat -g glfIndex files
- An example in.ped file looks like the following:
fam1 p1 0 0 1 1 fam1 p2 0 0 2 2 fam1 p3 p1 p2 1 3 fam2 p4 0 0 1 4 fam2 p5 0 0 2 5 fam2 p6 p4 p5 1 6 ...
- An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
- An example glfIndex file is like the following and the numbers (except zeros) in the 6th column in the above in.ped file have to be present in the first column.
1 /home/me/sample1.glf 2 /home/me/sample2.glf 3 /home/me/sample3.glf 4 /home/me/sample4.glf ...
- If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
- For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.
Some of command line options are explained below and others are self-explanatory.
-c : minimum cutoff of posterior probability to output a variant [Default: 0.5] --theta : scaled mutation rate per site [Default: 0.001] --tstv: prior of ts:tv ratio [Default: 2.0] --nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [Default: 1']
--denovo : a boolean flag to turn on de novo mutation detection. The following options take effect only when this flag is ON --rate_denovo : mutation rate per haplotype per generation. [Default: 1.5e-08] --tstv_denovo : the prior ts/tv ratio of de novo mutations. [Default: 2.0] --minLLR_denovo : minimum value of log10 likelihood ratio of allowing vs. disallowing de novo mutations in the data to output [Default: 1.0]
--gl_off: not to output genotype likelihood values for each individual. Default is to output 3 GLs for polymorphisms and 10 GLs for de novo mutations
- The output file is a VCF file and the specification can be found [here]
- Since there is no standard to represent de novo mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for de novo mutations.
- A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
Summary of reference -- 9 Total Entry Count: 141213431 Total Base Cout: 120124735 Total '0' Base Count: 137 Non-Polymorphic Count: 655457 Transition Count: 6556 Transversion Count: 3127 Other Polymorphism Count: 0 Filter counts: minMapQual 4550 minTotalDepth 1089 maxTotalDepth 736 Hard to call: 0 Skipped bases: 134
Creation of GLF files
- The current version performs variant calling and de novo mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [SAM/BAM] files as input. See the following for instructions on how to create GLF files.
samtools view -bh chr1.bam 1:0 | samtools calmd -Abr - human.v37.fa 2> /dev/null | samtools pileup - -g -f human.v37.fa > chr1.bam.glf
- For other functionalities please refer to the [samtools] website.
Source code with test files can be downloaded here.