|
|
(23 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
− | == Introduction ==
| + | #REDIRECT [[Polymutt]] |
− | * The program '''polymutt''' implemented a likelihood-based framework for calling '''single nucleotide variants''' and detecting '''''de novo''''' '''point mutation''' events in families for next-generation sequencing data. The program takes as input genotype likelihood format (GLF) files which can be generated following the [[#Creation of GLF files | Creation of GLF files]] instruction and outputs the result in the [[http://www.1000genomes.org/node/101 VCF]] format. The variant calling and ''de novo'' mutation detection are modelled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops. The input is a set of GLF files for each of family members and the relationships are specified through the .ped file.
| |
− | | |
− | * The evidence of variants and ''de novo'' mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for ''de novo'' mutation events a ''de novo'' quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing ''de novo'' mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
| |
− | | |
− | * Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled.
| |
− | | |
− | * If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced.
| |
− | | |
− | * See below for more details.
| |
− | | |
− | == Usage ==
| |
− | A command without any input will display the basic usage
| |
− | | |
− | polymutt
| |
− | | |
− | The following parameters are in effect:
| |
− | pedfile : test.ped (-pname)
| |
− | datfile : test.dat (-dname)
| |
− | glfIndexFile : test.gif (-gname)
| |
− | posterior cutoff : 0.900 (-c99.999)
| |
− | | |
− | Additional Options
| |
− | Map Quality Filter : --minMapQuality
| |
− | Depth Filter : --minDepth [150], --maxDepth [200],
| |
− | --minPercSampleWithData [0.00]
| |
− | Scaled mutation rate : --theta [1.0e-03]
| |
− | Prior of ts/tv ratio : --tstv [2.00]
| |
− | de novo mutation : --denovo, --denovo_rate [0.00],
| |
− | --denovo_tstv [0.50], --denovo_min_LLR [1.00]
| |
− | Optimization precision : --prec [1.0e-04]
| |
− | Multiple threading : --nthreads [4]
| |
− | Output : --vcf [test.vcf], --gl_off
| |
− | | |
− | | |
− | An example command for variant calling looks like the following:
| |
− | polymutt -p in.ped -d in.dat -g glfIndexFile --vcf out.vcf --nthreads 4
| |
− | | |
− | An example command for ''de novo'' mutation detection is as follows:
| |
− | polymutt -p in.ped -d in.dat -g glfIndexFile --denovo --rate_denovo 1.5e-08 --min_denovo_LLR 1.0 --vcf out.denovo.vcf --nthreads 4
| |
− | | |
− | == Input files ==
| |
− | | |
− | Required input files are -p input.ped -d input.dat -g glfIndex files
| |
− | | |
− | * An example in.ped file looks like the following:
| |
− | fam1 p1 0 0 1 1
| |
− | fam1 p2 0 0 2 2
| |
− | fam1 p3 p1 p2 1 3
| |
− | fam2 p4 0 0 1 4
| |
− | fam2 p5 0 0 2 5
| |
− | fam2 p6 p4 p5 1 6
| |
− | ...
| |
− | | |
− | * An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
| |
− | T GLF_Index
| |
− | | |
− | * An example glfIndex file is like the following and the numbers (except zeros) in the 6th column in the above in.ped file have to be present in the first column.
| |
− | 1 /home/me/sample1.glf
| |
− | 2 /home/me/sample2.glf
| |
− | 3 /home/me/sample3.glf
| |
− | 4 /home/me/sample4.glf
| |
− | ...
| |
− | | |
− | * If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
| |
− | * For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.
| |
− | | |
− | == Other options ==
| |
− | | |
− | Some of command line options are explained below and others are self-explanatory.
| |
− |
| |
− | -c : minimum cutoff of posterior probability to output a variant [''Default: 0.5'']
| |
− | --theta : scaled mutation rate per site [''Default: 0.001'']
| |
− | --tstv: prior of ts:tv ratio [''Default: 2.0'']
| |
− | --nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [''Default: 1']
| |
− | | |
− | --de_novo : a boolean flag to turn on ''de novo'' mutation detection. The following options take effect only when this flag is ON
| |
− | --rate_denovo : mutation rate per haplotype per generation. [''Default: 1e-08'']
| |
− | --tstv_denovo : the prior ts/tv ratio of ''de novo'' mutations. [''Default: 2.0'']
| |
− | --min_denovo_LLR : minimum value of log10 likelihood ratio of allowing vs. disallowing ''de novo'' mutations in the data to output [''Default: 1.0'']
| |
− | | |
− | == Output files ==
| |
− | * The output file is a VCF file and the specification can be found [[http://www.1000genomes.org/node/101 here]]
| |
− | * Since there is no standard to represent ''de novo'' mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for ''de novo'' mutations.
| |
− | * A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
| |
− | | |
− | Summary of reference -- 9
| |
− | Total Entry Count: 141213431
| |
− | Total Base Cout: 120124735
| |
− | Total '0' Base Count: 137
| |
− | Non-Polymorphic Count: 655457
| |
− | Transition Count: 6556
| |
− | Transversion Count: 3127
| |
− | Other Polymorphism Count: 0
| |
− | Filter counts:
| |
− | minMapQual 4550
| |
− | minTotalDepth 1089
| |
− | maxTotalDepth 736
| |
− | Hard to call: 0
| |
− | Skipped bases: 134
| |
− | | |
− | == Creation of GLF files ==
| |
− | | |
− | * The current version performs variant calling and ''de novo'' mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [[http://samtools.sourceforge.net/ SAM/BAM]] files as input. See the following for instructions on how to create GLF files.
| |
− | ** Download a modified version of samtools ( [[https://github.com/statgen/samtools-0.1.7a-hybrid samtools-hybrid]] )
| |
− | ** Prepare the reference genome in fasta format and sequence alignments in [[http://samtools.sourceforge.net/ SAM/BAM]] format
| |
− | ** Generate BAQ adjusted GLF files using the following command
| |
− | samtools view -bh chr1.bam 1:0 | samtools calmd -Abr - human.v37.fa 2> /dev/null | samtools pileup - -g -f human.v37.fa > chr1.bam.glf
| |
− | * For other functionalities please refer to the [[http://samtools.sourceforge.net/ samtools]] website.
| |
− | | |
− | == Download ==
| |
− | The pre-compiled 64-bit binary executable for linux with test files can be [[Media:polymutt.tar.gz | downloaded]] here . Source code will be available to download soon.
| |
− | | |
− | == Contact ==
| |
− | For questions please contact the authors (Bingshan Li: [mailto:bingshan@umich.edu bingshan@umich.edu] or Goncalo Abecasis: [mailto:goncalo@umich.edu goncalo@umich.edu])
| |