Polymutt: a tool for calling polymorphism and de novo mutations in families for sequencing data

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

  • The program polymutt implemented a likelihood-based framework for calling single nucleotide variants and detecting de novo point mutation events in families for next-generation sequencing data. The program takes as input genotype likelihood format (GLF) files which can be generated following the Creation of GLF files instruction and outputs the result in the [VCF] format. The variant calling and de novo mutation detection are modelled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops. The input is a set of GLF files for each of family members and the relationships are specified through the .ped file.
  • The evidence of variants and de novo mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for de novo mutation events a de novo quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing de novo mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
  • Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled.
  • If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced.
  • See below for more details.

Usage

A command without any input will display the basic usage

polymutt
The following parameters are in effect:
                      pedfile :        test.ped (-pname)
                      datfile :        test.dat (-dname)
                 glfIndexFile :        test.gif (-gname)
             posterior cutoff :           0.900 (-c99.999)
Additional Options
      Map Quality Filter : --minMapQuality
            Depth Filter : --minDepth [150], --maxDepth [200],
                           --minPercSampleWithData [0.00]
    Scaled mutation rate : --theta [1.0e-03]
    Prior of ts/tv ratio : --tstv [2.00]
        de novo mutation : --denovo, --denovo_rate [0.00],
                           --denovo_tstv [0.50], --denovo_min_LLR [1.00]
  Optimization precision : --prec [1.0e-04]
      Multiple threading : --nthreads [4]
                  Output : --vcf [test.vcf], --gl_off


An example command for variant calling looks like the following:

polymutt -p in.ped -d in.dat -g glfIndexFile --vcf out.vcf --nthreads 4

An example command for de novo mutation detection is as follows:

polymutt -p in.ped -d in.dat -g glfIndexFile --denovo --rate_denovo 1.5e-08 --min_denovo_LLR 1.0 --vcf out.denovo.vcf --nthreads 4

Input files

Required input files are -p input.ped -d input.dat -g glfIndex files

  • An example in.ped file looks like the following:
fam1 p1  0  0   1  1
fam1 p2  0  0   2  2
fam1 p3  p1 p2  1  3
fam2 p4  0  0   1  4
fam2 p5  0  0   2  5
fam2 p6  p4 p5  1  6
...
  • An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
T GLF_Index
  • An example glfIndex file is like the following and the numbers (except zeros) in the 6th column in the above in.ped file have to be present in the first column.
1  /home/me/sample1.glf
2  /home/me/sample2.glf
3  /home/me/sample3.glf
4  /home/me/sample4.glf
...
  • If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
  • For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.

Other options

Some of command line options are explained below and others are self-explanatory.

-c : minimum cutoff of posterior probability to output a variant [Default: 0.5]
--theta : scaled mutation rate per site [Default: 0.001]
--tstv: prior of ts:tv ratio [Default: 2.0]
--nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [Default: 1].
--de_novo : a boolean flag to turn on de novo mutation detection. The following options take effect only when this flag is ON
--rate_denovo : mutation rate per haplotype per generation. [Default: 1e-08]
--tstv_denovo : the prior ts/tv ratio of de novo mutations. [Default: 2.0]
--min_denovo_LLR : minimum value of log10 likelihood ratio of allowing vs. disallowing de novo mutations in the data to output [Default: 1]

Output files

  • The output file is a VCF file and the specification can be found [here]
  • Since there is no standard to represent de novo mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for de novo mutations.
  • A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
Summary of reference -- 9
Total Entry Count: 141213431 
Total Base Cout: 120124735
Total '0' Base Count:       137
Non-Polymorphic Count:   655457
Transition Count:      6556
Transversion Count:      3127
Other Polymorphism Count:         0
Filter counts:
       minMapQual 4550
       minTotalDepth 1089
       maxTotalDepth 736
Hard to call:         0
Skipped bases: 134

Creation of GLF files

  • The current version performs variant calling and de novo mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [SAM/BAM] files as input. See the following for instructions on how to create GLF files.
    • Download a modified version of samtools ( [samtools-hybrid] )
    • Prepare the reference genome in fasta format and sequence alignments in [SAM/BAM] format
    • Generate BAQ adjusted GLF files using the following command
samtools view -bh chr1.bam 1:0 | samtools calmd -Abr - human.v37.fa 2> /dev/null | samtools pileup - -g -f human.v37.fa > chr1.bam.glf
  • For other functionalities please refer to the [samtools] website.

Download

The pre-compiled 64-bit binary executable for linux with test files can be downloaded here . Source code will be available to download soon.

Contact

For questions please contact the authors (Bingshan Li: bingshan@umich.edu or Goncalo Abecasis: goncalo@umich.edu)