Difference between revisions of "Polymutt: a tool for calling polymorphism and de novo mutations in families for sequencing data"

From Genome Analysis Wiki
Jump to navigationJump to search
(Redirected page to Polymutt)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
== Updates ==
+
#REDIRECT [[Polymutt]]
The latest version of 0.03 is available for [[#Download | Download ]].
 
 
 
== Introduction ==
 
* The program '''polymutt''' implemented a likelihood-based framework  for calling '''single nucleotide variants''' and detecting '''''de novo''''' '''point mutation''' events in families for next-generation sequencing data. The program takes as input genotype likelihood format (GLF) files which can be generated following the  [[#Creation of GLF files | Creation of GLF files]] instruction and outputs the result in the [[http://www.1000genomes.org/node/101 VCF]] format. The variant calling and ''de novo'' mutation detection are modelled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops. The input is a set of GLF files for each of family members and the relationships are specified through the .ped file.
 
 
 
* The evidence of variants and ''de novo'' mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for ''de novo'' mutation events a ''de novo'' quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing ''de novo'' mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
 
 
 
* Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled.
 
 
 
* If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced.
 
 
 
* NOTE: This version only works for autosomes. Variant calling for X, Y and MT is in the testing process and will be available in next version.
 
 
 
* See below for more details.
 
 
 
== Usage ==
 
A command without any input will display the basic usage
 
 
 
polymutt
 
 
 
The following parameters are in effect:
 
                      pedfile :                (-pname)
 
                      datfile :                (-dname)
 
                  glfIndexFile :                (-gname)
 
              posterior cutoff :          0.500 (-c99.999)
 
 
 
Additional Options
 
      Map Quality Filter : --minMapQuality
 
            Depth Filter : --minDepth, --maxDepth,
 
                            --minPercSampleWithData [0.00]
 
    Scaled mutation rate : --theta [1.0e-03]
 
    Prior of ts/tv ratio : --poly_tstv [2.00]
 
        de novo mutation : --denovo, --rate_denovo [1.5e-08],
 
                            --tstv_denovo [2.00], --minLLR_denovo [1.00]
 
  Optimization precision : --prec [1.0e-04]
 
      Multiple threading : --nthreads [1]
 
  Chromosomes to process : --chr2process []
 
                  Output : --vcf [variantCalls.vcf], --gl_off
 
 
 
 
 
 
 
An example command for variant calling looks like the following:
 
polymutt -p in.ped -d in.dat -g glfIndexFile --vcf out.vcf --nthreads 4
 
 
 
An example command for ''de novo'' mutation detection is as follows:
 
polymutt -p in.ped -d in.dat -g glfIndexFile --denovo --rate_denovo 1.5e-08 --min_denovo_LLR 1.0 --vcf out.denovo.vcf --nthreads 4
 
 
 
== Input files ==
 
 
 
Required input files are -p input.ped -d input.dat -g glfIndex files
 
 
 
* An example in.ped file looks like the following:
 
fam1 p1  0  0  1  1
 
fam1 p2  0  0  2  2
 
fam1 p3  p1 p2  1  3
 
fam2 p4  0  0  1  4
 
fam2 p5  0  0  2  5
 
fam2 p6  p4 p5  1  6
 
...
 
 
 
* An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
 
T GLF_Index
 
 
 
* An example glfIndex file is like the following and the numbers (except zeros) in the 6th column in the above in.ped file have to be present in the first column.
 
1  /home/me/sample1.glf
 
2  /home/me/sample2.glf
 
3  /home/me/sample3.glf
 
4  /home/me/sample4.glf
 
...
 
 
 
* If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
 
* For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.
 
 
 
== Other options ==
 
 
 
Some of command line options are explained below and others are self-explanatory.
 
 
-c : minimum cutoff of posterior probability to output a variant [''Default: 0.5'']
 
--theta : scaled mutation rate per site [''Default: 0.001'']
 
--tstv: prior of ts:tv ratio [''Default: 2.0'']
 
--nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [''Default: 1']
 
 
 
--denovo : a boolean flag to turn on ''de novo'' mutation detection. The following options take effect only when this flag is ON
 
--rate_denovo : mutation rate per haplotype per generation. [''Default: 1.5e-08'']
 
--tstv_denovo : the prior ts/tv ratio of ''de novo'' mutations. [''Default: 2.0'']
 
--minLLR_denovo : minimum value of log10 likelihood ratio of allowing vs. disallowing ''de novo'' mutations in the data to output [''Default: 1.0'']
 
 
 
--chr2process: the chromosome names to process. Default is empty and is to process all chromosomes in the input.
 
                If multiple chromosomes are provided, they should be separated by comma, e.g. --chr2process 2,10 or --chr2process chr2,chr10
 
 
 
--gl_off: not to output genotype likelihood values for each individual. Default is to output 3 GLs for polymorphisms and 10 GLs for de novo mutations
 
 
 
== Output files ==
 
* The output file is a VCF file and the specification can be found [[http://www.1000genomes.org/node/101 here]]
 
* Since there is no standard to represent ''de novo'' mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for ''de novo'' mutations.
 
* A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
 
 
 
Summary of reference -- 9
 
Total Entry Count: 141213431
 
Total Base Cout: 120124735
 
Total '0' Base Count:      137
 
Non-Polymorphic Count:  655457
 
Transition Count:      6556
 
Transversion Count:      3127
 
Other Polymorphism Count:        0
 
Filter counts:
 
        minMapQual 4550
 
        minTotalDepth 1089
 
        maxTotalDepth 736
 
Hard to call:        0
 
Skipped bases: 134
 
 
 
== Creation of GLF files ==
 
 
 
* The current version performs variant calling and ''de novo'' mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [[http://samtools.sourceforge.net/ SAM/BAM]] files as input. See the following for instructions on how to create GLF files.
 
** Download a modified version of samtools ( [[https://github.com/statgen/samtools-0.1.7a-hybrid samtools-hybrid]] )
 
** Prepare the reference genome in fasta format and sequence alignments in [[http://samtools.sourceforge.net/ SAM/BAM]] format
 
** Generate BAQ adjusted GLF files using the following command
 
samtools view -bh chr1.bam 1:0 | samtools calmd -Abr - human.v37.fa 2> /dev/null | samtools pileup - -g -f human.v37.fa > chr1.bam.glf
 
* For other functionalities please refer to the  [[http://samtools.sourceforge.net/ samtools]] website.
 
 
 
== Download ==
 
The latest version of source code v0.03 with test files can be [[Media:polymutt.0.03.tar.gz | downloaded]] here.
 
 
 
== Contact ==
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan@umich.edu bingshan@umich.edu] or Goncalo Abecasis: [mailto:goncalo@umich.edu goncalo@umich.edu])
 
 
 
[[Category:Software]]
 

Latest revision as of 14:45, 3 December 2014

Redirect to: