Difference between revisions of "Bayesdenovo"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with "== Update == v0.01 is available for download == Compilation == * After downloading the source code, unzip and untar it, and cd bayesdenovo, and then type Make...")
 
Line 68: Line 68:
  
 
== Filtering ==
 
== Filtering ==
We recommend two filtering strategies. The first is a simple filtering and the second one is more advance
+
We recommend two filtering strategies. The first is a simple filtering and the second one is more advance. Please see the triodenovo page below for more information:
  
1. Basic filtering for SNVs. The following filter will retain sites of single nucleotides with only two alleles, QUAL>=30, and mutations in which parents are homozygous references and child is heterozygote with the heterozygote PL being zero, and the minimum PL of the other two genotypes in offering is 30 (i.e. the genotype likelihood, defined as P(R|G) in which R represents the aligned bases and G is the underlying genotype, of the called het mutation is >1000 than the genotype likelihood of the other two genotypes). These filtering parameters can be tuned as needed in the following command.
+
http://genome.sph.umich.edu/wiki/Triodenovo
 
 
less trio.vcf.out | egrep "DQ|#" | perl -lane 'print if /#/; next if length($F[3])>1 || length($F[4])>1 || $F[4]=~/,/; next if $F[5]<30; $F[9] =~ /([A-Z])\/([A-Z])/; next if $1 ne $2; next if $F[10] !~ /$1\/$1/; $F[11]=~/([A-Z])\/([A-Z])/; next if $1 eq $2; $F[11] =~ /(\d+),(\d+),(\d+)/; next if $2 != 0 || $1<30 || $3<30; print' | less
 
 
 
2. Advanced filtering using a machine-learning approach (i.e. DNMFilter in the following webpage)
 
 
 
http://humangenome.duke.edu/software
 
  
 
3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.
 
3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.
Line 85: Line 79:
  
 
== Download ==
 
== Download ==
Source code of v0.05 [[Media:triodenovo.0.05.tar.gz | download]] here.
+
Source code of v0.01 [[Media:bayesdenovo.0.01.tar.gz | download]] here.
  
 
== Contact ==
 
== Contact ==
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan.li@vanderbilt.edu bingshan.li@vanderbilt.edu])
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan.li@vanderbilt.edu bingshan.li@vanderbilt.edu])

Revision as of 12:30, 25 October 2016

Update

v0.01 is available for download

Compilation

  • After downloading the source code, unzip and untar it, and cd bayesdenovo, and then type Make
  • If you encountered errors related to deprecated usage of some syntax please try to comment out the following in the core/Makefile
CXXFLAGS += -Werror -Wno-unused-variable -Wno-unused-result

Introduction

  • The program bayesdenovo implemented a Bayesian framework for calling de novo mutations in nuclear families (including trios, quartets, and families with more siblings) for next-generation sequencing data.
  • It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
  • It calculates the likelihood of the model with de novo mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the de novo evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the de novo quality is represented as DQ=log10(BF) = log10(L1/L0).
  • DQ is the major parameter to control the output, along with others. See the example output file below
  • We recommend some basic and also a more advanced filtering

Usage

A command without any input will invoke triodenovo and display the following message

   *** This build (v.0.1) was compiled on Oct 25 2016, 11:16:22 ***
                      pedfile :                 (-pname)
                      datfile :                 (-dname)
                      mapfile :                 (-mname)
Additional Options
                       Input : --in_vcf [], --submap [1.00]
  Denovo mutation parameters : --tstv_ratio [2.00], --minDQ [5.00]
             Multi-threading : --nthreads [1]
                      Output : --out_prefix []
  • Example 1: using default parameters
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo
  • Example 2: using --minDQ 7 to output de novo calls which are a minimum DQ of 7.
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo --min_DQ 7

Input files

quartet1 p1  0  0   1
quartet1 p2  0  0   2
quartet1 p3  p1 p2  1
  • A VCF file [VCF specs]. It can contain variant information for more individuals than in the ped file.
    • Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.

Output

  • The output will be one file per family, and the prefix to the names is specified via --out_prefix

An example of output file is as follows

##fileformat=VCFv4.1 
##ProgramStart=Tue Oct 25 11:20:39 2016
##BayesDeNovo=../bin/bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo 
##Note=VCF file modified by polymutt2. Updated fileds include: QUAL, GT and GQ, and AC. NOTE: modification was applied only to biallelic variants
##FILTER=<ID=LOWDP,Description="Low Depth filter when the average depth per sample is lessn than 1">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Alternative Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Alternative Allele Count">
##INFO=<ID=FDQ,Number=1,Type=Integer,Description="Family-wise De Novo Mutatoin Quality in log10(BF) format">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DQ,Number=1,Type=Integer,Description="De Novo Mutation Quality in log10(BF) format">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=IV,Number=2,Type=Integer,Description="Best path inheritance vector. Founder alleles are arbiturally labeled (1 to 2*nFounders) and L1|L2 for non-founders indicated L1 and L2 from founders are transmitted">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Phred-scaled Genotype Likelihoods">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Rep1_1_1        Rep1_1_2        Rep1_1_3        Rep1_1_4
1       118     .       C       A       100     .       AF=0.333333;AC=7;DP=1629;FDQ=8.667414   GT:DQ:DP:PL     0/0:.:202:1|2:0,100,255 0/0:.:234:3|4:0,100,255 0/1:8.97:203:1|3:100,0,255      0/0:.:206:2|4:0,100,255
1       858     .       C       A       100     .       AF=0.333333;AC=10;DP=1592;FDQ=8.688325  GT:DQ:DP:PL     0/0:.:184:1|2:0,100,255 0/0:.:208:3|4:0,100,255 0/0:.:220:1|3:0,100,255 0/1:8.99:197:2|4:100,0,255


Filtering

We recommend two filtering strategies. The first is a simple filtering and the second one is more advance. Please see the triodenovo page below for more information:

http://genome.sph.umich.edu/wiki/Triodenovo

3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.

  • If you have a multi-sample call VCF it may be helpful to select those mutation candidates that appear only once in your VCF (AC=1 for example). This can be the top tier to consider. Relaxing AC to 2 or 3 can recover more real mutations but also increase false positives.
  • If it is too stringent to filter out known sites, it may be helpful to select candidates that have low (e.g. <0.002)1000G or ESP allele frequencies. Some mutations can occur on know variant sites but mutations with high population frequencies may not be of great interest, if indeed they are real.
  • Candidates in segmental duplications, low complexity regions or other copy number regions may be flagged for further analysis.
  • Candidates for which parents are not hom-ref or offspring is a double mutant are more likely to be due to artifacts so the interpretation of these candidates may require additional QC if they appear to be interesting to the investigators.

Download

Source code of v0.01 download here.

Contact

For questions please contact the authors (Bingshan Li: bingshan.li@vanderbilt.edu)