Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Blanked the page
Line 1: Line 1: −
== Updates ==
  −
The latest version of 0.03 is available for [[#Download | Download ]].
     −
== Introduction ==
  −
* The program '''polymutt''' implemented a likelihood-based framework  for calling '''single nucleotide variants''' and detecting '''''de novo''''' '''point mutation''' events in families for next-generation sequencing data. The program takes as input genotype likelihood format (GLF) files which can be generated following the  [[#Creation of GLF files | Creation of GLF files]] instruction and outputs the result in the [[http://www.1000genomes.org/node/101 VCF]] format. The variant calling and ''de novo'' mutation detection are modelled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops. The input is a set of GLF files for each of family members and the relationships are specified through the .ped file.
  −
  −
* The evidence of variants and ''de novo'' mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for ''de novo'' mutation events a ''de novo'' quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing ''de novo'' mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
  −
  −
* Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled.
  −
  −
* If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced.
  −
  −
* NOTE: This version only works for autosomes. Variant calling for X, Y and MT is in the testing process and will be available in next version.
  −
  −
* See below for more details.
  −
  −
== Usage ==
  −
A command without any input will display the basic usage
  −
  −
polymutt
  −
  −
The following parameters are in effect:
  −
                      pedfile :                (-pname)
  −
                      datfile :                (-dname)
  −
                  glfIndexFile :                (-gname)
  −
              posterior cutoff :          0.500 (-c99.999)
  −
  −
Additional Options
  −
      Map Quality Filter : --minMapQuality
  −
            Depth Filter : --minDepth, --maxDepth,
  −
                            --minPercSampleWithData [0.00]
  −
    Scaled mutation rate : --theta [1.0e-03]
  −
    Prior of ts/tv ratio : --poly_tstv [2.00]
  −
        de novo mutation : --denovo, --rate_denovo [1.5e-08],
  −
                            --tstv_denovo [2.00], --minLLR_denovo [1.00]
  −
  Optimization precision : --prec [1.0e-04]
  −
      Multiple threading : --nthreads [1]
  −
  Chromosomes to process : --chr2process []
  −
                  Output : --vcf [variantCalls.vcf], --gl_off
  −
  −
  −
  −
An example command for variant calling looks like the following:
  −
polymutt -p in.ped -d in.dat -g glfIndexFile --vcf out.vcf --nthreads 4
  −
  −
An example command for ''de novo'' mutation detection is as follows:
  −
polymutt -p in.ped -d in.dat -g glfIndexFile --denovo --rate_denovo 1.5e-08 --min_denovo_LLR 1.0 --vcf out.denovo.vcf --nthreads 4
  −
  −
== Input files ==
  −
  −
Required input files are -p input.ped -d input.dat -g glfIndex files
  −
  −
* An example in.ped file looks like the following:
  −
fam1 p1  0  0  1  1
  −
fam1 p2  0  0  2  2
  −
fam1 p3  p1 p2  1  3
  −
fam2 p4  0  0  1  4
  −
fam2 p5  0  0  2  5
  −
fam2 p6  p4 p5  1  6
  −
...
  −
  −
* An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
  −
T GLF_Index
  −
  −
* An example glfIndex file is like the following and the numbers (except zeros) in the 6th column in the above in.ped file have to be present in the first column.
  −
1  /home/me/sample1.glf
  −
2  /home/me/sample2.glf
  −
3  /home/me/sample3.glf
  −
4  /home/me/sample4.glf
  −
...
  −
  −
* If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
  −
* For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.
  −
  −
== Other options ==
  −
  −
Some of command line options are explained below and others are self-explanatory.
  −
  −
-c : minimum cutoff of posterior probability to output a variant [''Default: 0.5'']
  −
--theta : scaled mutation rate per site [''Default: 0.001'']
  −
--tstv: prior of ts:tv ratio [''Default: 2.0'']
  −
--nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [''Default: 1']
  −
  −
--denovo : a boolean flag to turn on ''de novo'' mutation detection. The following options take effect only when this flag is ON
  −
--rate_denovo : mutation rate per haplotype per generation. [''Default: 1.5e-08'']
  −
--tstv_denovo : the prior ts/tv ratio of ''de novo'' mutations. [''Default: 2.0'']
  −
--minLLR_denovo : minimum value of log10 likelihood ratio of allowing vs. disallowing ''de novo'' mutations in the data to output [''Default: 1.0'']
  −
  −
--chr2process: the chromosome names to process. Default is empty and is to process all chromosomes in the input.
  −
                If multiple chromosomes are provided, they should be separated by comma, e.g. --chr2process 2,10 or --chr2process chr2,chr10
  −
  −
--gl_off: not to output genotype likelihood values for each individual. Default is to output 3 GLs for polymorphisms and 10 GLs for de novo mutations
  −
  −
== Output files ==
  −
* The output file is a VCF file and the specification can be found [[http://www.1000genomes.org/node/101 here]]
  −
* Since there is no standard to represent ''de novo'' mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for ''de novo'' mutations.
  −
* A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
  −
  −
Summary of reference -- 9
  −
Total Entry Count: 141213431
  −
Total Base Cout: 120124735
  −
Total '0' Base Count:      137
  −
Non-Polymorphic Count:  655457
  −
Transition Count:      6556
  −
Transversion Count:      3127
  −
Other Polymorphism Count:        0
  −
Filter counts:
  −
        minMapQual 4550
  −
        minTotalDepth 1089
  −
        maxTotalDepth 736
  −
Hard to call:        0
  −
Skipped bases: 134
  −
  −
== Creation of GLF files ==
  −
  −
* The current version performs variant calling and ''de novo'' mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [[http://samtools.sourceforge.net/ SAM/BAM]] files as input. See the following for instructions on how to create GLF files.
  −
** Download a modified version of samtools ( [[https://github.com/statgen/samtools-0.1.7a-hybrid samtools-hybrid]] )
  −
** Prepare the reference genome in fasta format and sequence alignments in [[http://samtools.sourceforge.net/ SAM/BAM]] format
  −
** Generate BAQ adjusted GLF files using the following command
  −
samtools view -bh chr1.bam 1:0 | samtools calmd -Abr - human.v37.fa 2> /dev/null | samtools pileup - -g -f human.v37.fa > chr1.bam.glf
  −
* For other functionalities please refer to the  [[http://samtools.sourceforge.net/ samtools]] website.
  −
  −
== Download ==
  −
The latest version of source code v0.03 with test files can be [[Media:polymutt.0.03.tar.gz | downloaded]] here.
  −
  −
== Contact ==
  −
For questions please contact the authors (Bingshan Li:  [mailto:bingshan@umich.edu bingshan@umich.edu] or Goncalo Abecasis: [mailto:goncalo@umich.edu goncalo@umich.edu])
  −
  −
[[Category:Software]]
 
480

edits

Navigation menu