Difference between revisions of "Polymutt"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
== NOTE ==
 +
If you are interested in calling '''''de novo''''' mutations in '''trios''' based on '''VCF''' files, we recommend our new tool, triodenovo, which implemented a nicer algorithm with a more natural interpretation of the ''''''de novo''''' quality. Please check it out following the link below. Thanks for trying it out!
 +
 +
http://genome.sph.umich.edu/wiki/Triodenovo
 +
 
== Updates ==
 
== Updates ==
The latest version of 0.13 is available for [[#Download | Download]].
+
The latest version of 0.18 is available for [[#Download | Download]].
 +
 
 +
v0.18 fixed a bug when it reported inbreeding for some pedigrees which are not inbreeding families
 +
 
 +
v0.17 fixed a bug when some of the samples in the per files are not in the input vcf file
 +
 
 +
v0.16 fixed a bug when the input is a VCF file with multiple nuclear families and the ped file contains only a single nuclear family.
 +
 
 +
v0.15 added an option (--mixed_vcf_records) to handle input vcf files in which mixed records with different FORMAT fields are present.
 +
 
 +
v0.14 implemented both inherited variant calling and '''''de novo''''' mutation detection from VCF input files. If you have a VCF file with PL or GL fields, you can run polymutt on the VCF file to quickly and conveniently call variants and mutations.
 +
*NOTE: When there is missing data in a trio or family in VCF files, the '''''de novo''''' mutation calling is not reliable and often times is not possible. So these sites should be ignored for '''''de novo''''' mutations after calling.
  
v0.13 fixed the bug for generating genotypes when the input is a VCF file and the ped file contains only a single nuclear family.
+
v0.13 fixed the bug for generating genotypes when the input is a VCF file and the ped file contains only a single nuclear family. Like unrelated samples (e.g.  [[http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0 GATK]] recommends at least 30 samples), it is also desirable to use more families or mixture of families and unrelated samples for polymutt.
  
 
== Introduction ==
 
== Introduction ==
 
* The program '''polymutt''' implemented a likelihood-based framework for calling '''single nucleotide variants''' and detecting '''''de novo''''' '''point mutation''' events in families for next-generation sequencing data.  
 
* The program '''polymutt''' implemented a likelihood-based framework for calling '''single nucleotide variants''' and detecting '''''de novo''''' '''point mutation''' events in families for next-generation sequencing data.  
  
* The program takes as input genotype likelihood format (GLF) files which can be generated following the  [[#Creation of GLF files | Creation of GLF files]] instruction and outputs the result in the [[http://www.1000genomes.org/node/101 VCF]] format. For variant calling, alternatively polymutt can also take the VCF format input in which either the PL or the GL field are present. Commonly used variant calling algorithms such as GATK and samtools by default generate PL values in the VCF files. Current version works only on biallelic variants and non-biallelic variants in the VCF files will be ignored.
+
* The program takes as input genotype likelihood format (GLF) files which can be generated following the  [[#Creation of GLF files | Creation of GLF files]] instruction and outputs the result in the [[http://www.1000genomes.org/node/101 VCF]] format. Alternatively polymutt can also take the VCF format input in which either the PL or the GL field are present. Commonly used variant calling algorithms such as GATK and samtools by default generate PL values in the VCF files. Current version works only on biallelic variants and non-biallelic variants in the VCF files will be ignored.
  
 
* The variant calling and ''de novo'' mutation detection are modeled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops.
 
* The variant calling and ''de novo'' mutation detection are modeled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops.
Line 22: Line 38:
  
 
== Additional Notes ==
 
== Additional Notes ==
 +
 
* All GLF files (and BAM files) in the input have to have IDENTICAL chromosome orders. Polymutt will go through the chromosomes in the order until when one GLF file has a different chromosome from others. All results prior to that problematic chromosome are valid though.
 
* All GLF files (and BAM files) in the input have to have IDENTICAL chromosome orders. Polymutt will go through the chromosomes in the order until when one GLF file has a different chromosome from others. All results prior to that problematic chromosome are valid though.
  
 
** If you do have different orders or even different numbers of chromosomes in the BAM files, you can create GLF files for individual chromosomes and run polymutt on matched chromosomes.
 
** If you do have different orders or even different numbers of chromosomes in the BAM files, you can create GLF files for individual chromosomes and run polymutt on matched chromosomes.
  
* For ''de novo'' mutations, the current version can only take GLF files and only detect single nucleotide mutations. It does not call ''de novo'' mutations on X, Y and MT chromosomes, and please ignore records in these non-autosomes. Indels are not handled either.
+
* The current version does NOT call ''de novo'' mutations on X, Y and MT chromosomes, and please ignore records in these non-autosomes.
 +
 
 +
* For ''de novo'' mutations, it is usually helpful to explore various mutation rate in addition to the default one (1.5x10-8). For depth lower than 30X for example, the support of ''de novo'' mutation will be weak given the low mutation rate of the default value. Trying higher values of mutation rates (e.g. 10-6 or 10-7)  may be able to pick up these sites with low depth.
  
 
* Some of the features will be implemented in future versions.
 
* Some of the features will be implemented in future versions.
 
* For ''de novo'' mutations, it is usually helpful to explore various mutation rate in addition to the default one (1.5x10-8). For depth lower than 30X for example, the support of ''de novo'' mutation will be weak given the low mutation rate of the default value. Trying higher values of mutation rates (e.g. 10-6 or 10-7)  may be able to pick up these sites with low depth.
 
  
 
== Usage ==
 
== Usage ==
Line 64: Line 81:
 
  polymutt -p input.ped -d input.dat  --in_vcf input.vcf --out_vcf out.vcf --nthreads 4
 
  polymutt -p input.ped -d input.dat  --in_vcf input.vcf --out_vcf out.vcf --nthreads 4
  
Examples for ''de novo'' mutation detection (works only for GLF files):
+
Examples for ''de novo'' mutation detection
 
  polymutt -p input.ped -d input.dat -g input.gif --denovo --out_vcf out.denovo.vcf --nthreads 4
 
  polymutt -p input.ped -d input.dat -g input.gif --denovo --out_vcf out.denovo.vcf --nthreads 4
  polymutt -p input.ped -d input.dat -g input.gif --denovo --rate_denovo 1.2e-06 --out_vcf out.denovo.vcf --nthreads 4
+
  polymutt -p input.ped -d input.dat -g input.gif --out_vcf out.vcf --denovo
  
 
Examples of calling X, Y and MT (works only for variants but not de novo mutations):
 
Examples of calling X, Y and MT (works only for variants but not de novo mutations):
Line 104: Line 121:
 
'''Option 2'''
 
'''Option 2'''
 
Alternatively, if you want to refine the variant and genotype calling using family relatedness based on your existing VCF files, polymutt can take a VCF file as input. In this case, the VCF file has to have the PL or the GL field, which is usually available from commonly used tools (e.g. GATK and samtools).
 
Alternatively, if you want to refine the variant and genotype calling using family relatedness based on your existing VCF files, polymutt can take a VCF file as input. In this case, the VCF file has to have the PL or the GL field, which is usually available from commonly used tools (e.g. GATK and samtools).
 
''NOTE'': this options does not work for de novo mutation detection in this version due to the lack of sequencing information in most VCF files.
 
  
 
In this option, you can specify --in_vcf input.vcf in place of -g input.gif for variant calling. If both the --in_vcf ang -g options are specified, --in_vcf will take action while -g will not. The .ped and .dat files are as in Option 1 but only first 5 columns are in effect and other columns will be ignored. You can remove the GLF_Index column but currently it still requires the presence of .dat file even if it is empty (will make it more flexible in future versions).
 
In this option, you can specify --in_vcf input.vcf in place of -g input.gif for variant calling. If both the --in_vcf ang -g options are specified, --in_vcf will take action while -g will not. The .ped and .dat files are as in Option 1 but only first 5 columns are in effect and other columns will be ignored. You can remove the GLF_Index column but currently it still requires the presence of .dat file even if it is empty (will make it more flexible in future versions).
Line 164: Line 179:
  
 
== Download ==
 
== Download ==
The latest version of source code v0.13 with test files can be [[Media:Polymutt.0.13.tar.gz | downloaded]] here.
+
The latest version of source code v0.18 with test files can be [[Media:Polymutt.0.18.tar.gz | downloaded]] here.
A precompiled version on Ubuntu 10.04 (works on CentOS 6.3 as well) is available for [[Media:polymutt.0.13.binary.tar.gz | download]]
 
  
 
== Contact ==
 
== Contact ==
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan@umich.edu bingshan@umich.edu])
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan@umich.edu bingshan@umich.edu])
 +
 +
== Citation ==
 +
Li B, Chen W, Zhan X, Busonero F, Sanna S, et al. (2012) A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families. PLoS Genet 8(10): e1002944. doi:10.1371/journal.pgen.1002944
  
 
[[Category:Software]]
 
[[Category:Software]]

Latest revision as of 10:12, 27 April 2014

NOTE

If you are interested in calling de novo mutations in trios based on VCF files, we recommend our new tool, triodenovo, which implemented a nicer algorithm with a more natural interpretation of the 'de novo quality. Please check it out following the link below. Thanks for trying it out!

http://genome.sph.umich.edu/wiki/Triodenovo

Updates

The latest version of 0.18 is available for Download.

v0.18 fixed a bug when it reported inbreeding for some pedigrees which are not inbreeding families

v0.17 fixed a bug when some of the samples in the per files are not in the input vcf file

v0.16 fixed a bug when the input is a VCF file with multiple nuclear families and the ped file contains only a single nuclear family.

v0.15 added an option (--mixed_vcf_records) to handle input vcf files in which mixed records with different FORMAT fields are present.

v0.14 implemented both inherited variant calling and de novo mutation detection from VCF input files. If you have a VCF file with PL or GL fields, you can run polymutt on the VCF file to quickly and conveniently call variants and mutations.

  • NOTE: When there is missing data in a trio or family in VCF files, the de novo mutation calling is not reliable and often times is not possible. So these sites should be ignored for de novo mutations after calling.

v0.13 fixed the bug for generating genotypes when the input is a VCF file and the ped file contains only a single nuclear family. Like unrelated samples (e.g. [GATK] recommends at least 30 samples), it is also desirable to use more families or mixture of families and unrelated samples for polymutt.

Introduction

  • The program polymutt implemented a likelihood-based framework for calling single nucleotide variants and detecting de novo point mutation events in families for next-generation sequencing data.
  • The program takes as input genotype likelihood format (GLF) files which can be generated following the Creation of GLF files instruction and outputs the result in the [VCF] format. Alternatively polymutt can also take the VCF format input in which either the PL or the GL field are present. Commonly used variant calling algorithms such as GATK and samtools by default generate PL values in the VCF files. Current version works only on biallelic variants and non-biallelic variants in the VCF files will be ignored.
  • The variant calling and de novo mutation detection are modeled jointly within families and can handle both nuclear and extended pedigrees without consanguinity loops.
  • Since unrelated individuals are kind of special case of families, unrelated individuals or a mixture of related and unrelated individuals can be handled. The relationship is specified in the input .ped file and for unrelated individuals each of them can be assigned a unique family ID.
  • The evidence of variants and de novo mutations are assessed probabilistically. For a variant, the QUAL value is calculated as -10*log10(1-posterior(Variant | Data)) and for de novo mutation events a de novo quality (DQ) value is defined as log10(lk_denovo / lk_no_denovo) where lk_denovo and lk_no_denovo are the likelihoods of data allowing and disallowing de novo mutations respectively. Similarly, for each genotype, a genotype quality (GQ) value is defined as -10*log10(1-posterior(Genotype | Data)).
  • If some individuals in a family are not sequenced, this can be handled by setting the corresponding GLF file indices to zero for those family members who are not sequenced, if the input are GLF files. For VCF input, all individuals in the .ped file but not in the VCF files are considered missing data (not sequenced).
  • Variant calling for X, Y and MT has been only lightly tested. Any comments/suggestions about polymutt and non-autosomal variant calling in particular are appreciated.
  • See below for more details and see "README" in the download for more info.

Additional Notes

  • All GLF files (and BAM files) in the input have to have IDENTICAL chromosome orders. Polymutt will go through the chromosomes in the order until when one GLF file has a different chromosome from others. All results prior to that problematic chromosome are valid though.
    • If you do have different orders or even different numbers of chromosomes in the BAM files, you can create GLF files for individual chromosomes and run polymutt on matched chromosomes.
  • The current version does NOT call de novo mutations on X, Y and MT chromosomes, and please ignore records in these non-autosomes.
  • For de novo mutations, it is usually helpful to explore various mutation rate in addition to the default one (1.5x10-8). For depth lower than 30X for example, the support of de novo mutation will be weak given the low mutation rate of the default value. Trying higher values of mutation rates (e.g. 10-6 or 10-7) may be able to pick up these sites with low depth.
  • Some of the features will be implemented in future versions.

Usage

A command without any input will display the basic usage

polymutt
The following parameters are in effect:
                      pedfile :                 (-pname)
                      datfile :                 (-dname)
                 glfIndexFile :                 (-gname)
             posterior cutoff :            0.50 (-c99.999)
Additional Options
  Alternative input file : --in_vcf []
    Scaled mutation rate : --theta [1.0e-03], --indel_theta [1.0e-04]
    Prior of ts/tv ratio : --poly_tstv [2.00]
     Non-autosome labels : --chrX [X], --chrY [Y], --MT [MT]
        de novo mutation : --denovo, --rate_denovo [1.5e-08],
                           --tstv_denovo [2.00], --minLLR_denovo [0.01]
  Optimization precision : --prec [1.0e-04]
      Multiple threading : --nthreads [1]
                 Filters : --minMapQuality, --minDepth, --maxDepth,
                           --minPercSampleWithData [0.00]
                  Output : --out_vcf [], --pos [], --all_sites, --gl_off,
                           --quick_call


An example command for variant calling looks like the following:

polymutt -p input.ped -d input.dat -g input.gif --out_vcf out.vcf --nthreads 4

An example command for variant calling taking a VCF file as input looks like the following:

polymutt -p input.ped -d input.dat  --in_vcf input.vcf --out_vcf out.vcf --nthreads 4

Examples for de novo mutation detection

polymutt -p input.ped -d input.dat -g input.gif --denovo --out_vcf out.denovo.vcf --nthreads 4
polymutt -p input.ped -d input.dat -g input.gif --out_vcf out.vcf --denovo

Examples of calling X, Y and MT (works only for variants but not de novo mutations):

polymutt -p input.ped -d input.dat -g input.gif --out_vcf out.vcf --chrX X --chrY Y --MT MT --nthreads 4
polymutt -p input.ped -d input.dat --in_vcf input.vcf --out_vcf out.vcf --chrX X --chrY Y --MT MT --nthreads 4

Input files

Option 1

Required input files are -p input.ped -d input.dat -g input.gif. the input.ped file specifies sample relationship and input.dat and input.gif specify sequence data.

  • An example in.ped file looks like the following (for more info refer to the merlin documentation[1]):
fam1 p1  0  0   1  1
fam1 p2  0  0   2  2
fam1 p3  p1 p2  1  3
fam2 p4  0  0   1  4
fam2 p5  0  0   2  5
fam2 p6  p4 p5  1  6
...
  • An example in.dat file is like the following (for the 6th column above and in addition other traits/markers can be specified but will be ignored):
T GLF_Index

In the above .dat file, it specifies the GLF_Index for the 6th column in the .ped file. If -g input.gif is specified, then the input.gif looks like the following, where the numbers (except zeros) in the 6th column in the above input.ped file have to be present in the first column.

1  /home/me/sample1.glf
2  /home/me/sample2.glf
3  /home/me/sample3.glf
4  /home/me/sample4.glf
...
  • If some of the members are not sequenced but are in the pedigree because of the relatedness with other members, the GLF_Index column (6th column) in the ped file should be set to zero
  • For unrelated individuals, you can either (1) create a family for each unrelated individual as a founder or (2) put all unrelated individuals as founders in a single family.
  • See "examples" in the downlaod for more info.

Option 2 Alternatively, if you want to refine the variant and genotype calling using family relatedness based on your existing VCF files, polymutt can take a VCF file as input. In this case, the VCF file has to have the PL or the GL field, which is usually available from commonly used tools (e.g. GATK and samtools).

In this option, you can specify --in_vcf input.vcf in place of -g input.gif for variant calling. If both the --in_vcf ang -g options are specified, --in_vcf will take action while -g will not. The .ped and .dat files are as in Option 1 but only first 5 columns are in effect and other columns will be ignored. You can remove the GLF_Index column but currently it still requires the presence of .dat file even if it is empty (will make it more flexible in future versions).

  • See "examples" in the download for more info.

Other options

Some of command line options are explained below and others are self-explanatory.

-c : minimum cutoff of posterior probability to output a variant [Default: 0.5]
--theta : scaled mutation rate per site for single nucleotide variants [Default: 0.001]
--indel_theta : scaled mutation rate per site for Indels (works only for VCF input with indel calls) [Default: 0.0001]
--poly_tstv: prior of ts:tv ratio [Default: 2.0]
--nthreads : number of threads to run and it is recommended to use 4 threads for small number of input files [Default: 1']
  • The following applies only to GLF input
--denovo : a boolean flag to turn on de novo mutation detection. The following options take effect only when this flag is ON
--rate_denovo : mutation rate per haplotype per generation. [Default: 1.5e-08]
--tstv_denovo : the prior ts/tv ratio of de novo mutations. [Default: 2.0]
--minLLR_denovo : minimum value of log10 likelihood ratio of allowing vs. disallowing de novo mutations in the data to output [Default: 1.0]
--pos : a file with two columns (chr pos) to output genotypes of all individuals, even if the sites are monomorphic
--all_sites : If turned on, all sites with at least one read coverage will be output.
--gl_off : If turned on, not to output genotype likelihood values for each individual. Default is to output 3 GLs for polymorphisms and 10 GLs for de novo mutations
--quick_call : If turned on, it will perform variant calling assuming that all individuals are unrelated, and if a site is detected as a variant site then the family-aware variant calling will be performed. This will be beneficial for complex pedigrees for which the likelihood calculation may be demanding.

Output files

  • The output file is a VCF file and the specification can be found [here]
  • Since there is no standard to represent de novo mutations in the current VCF specification, actual genotypes (e.g. [ACGT]/[ACGT]) are output in the VCF file for de novo mutations.
  • A summary about variant calling statistics is output to STDOUT and it may be redirected to a file for a record.
Summary of reference -- 9
Total Entry Count: 141213431 
Total Base Cout: 120124735
Total '0' Base Count:       137
Non-Polymorphic Count:   655457
Transition Count:      6556
Transversion Count:      3127
Other Polymorphism Count:         0
Filter counts:
       minMapQual 4550
       minTotalDepth 1089
       maxTotalDepth 736
Hard to call:         0
Skipped bases: 134

Creation of GLF files

  • The current version performs variant calling and de novo mutation detection from files in the genotype likelihood format (GLF). In future versions we plan to take [SAM/BAM] files as input. See the following for instructions on how to create GLF files.
samtools-hybrid view -bh chr1.bam 1:0 | samtools-hybrid calmd -Abr - human.v37.fa 2> /dev/null | samtools-hybrid pileup - -g -f human.v37.fa > chr1.bam.glf
    • If you want to clip overlapping reads, you can add clipOverlap command
samtools-hybrid view -bh chr1.bam 1:0 |samtools-hybrid calmd -Abr - human.v37.fa 2> /dev/null | bam clipOverlap --in -.bam --out -.ubam| samtools-hybrid pileup - -g -f human.v37.fa > chr1.bam.glf
  • For other functionalities please refer to the [samtools] website.

Download

The latest version of source code v0.18 with test files can be downloaded here.

Contact

For questions please contact the authors (Bingshan Li: bingshan@umich.edu)

Citation

Li B, Chen W, Zhan X, Busonero F, Sanna S, et al. (2012) A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families. PLoS Genet 8(10): e1002944. doi:10.1371/journal.pgen.1002944