Difference between revisions of "Bayesdenovo"

From Genome Analysis Wiki
Jump to navigationJump to search
 
Line 92: Line 92:
  
 
http://genome.sph.umich.edu/wiki/Triodenovo
 
http://genome.sph.umich.edu/wiki/Triodenovo
 
3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.
 
* If you have a multi-sample call VCF it may be helpful to select those mutation candidates that appear only once in your VCF (AC=1 for example). This can be the top tier to consider. Relaxing AC to 2 or 3 can recover more real mutations but also increase false positives.
 
* If it is too stringent to filter out known sites, it may be helpful to select candidates that have low (e.g. <0.002)1000G or ESP allele frequencies. Some mutations can occur on know variant sites but mutations with high population frequencies may not be of great interest, if indeed they are real.
 
* Candidates in segmental duplications, low complexity regions or other copy number regions may be flagged for further analysis.
 
* Candidates for which parents are not hom-ref or offspring is a double mutant are more likely to be due to artifacts so the interpretation of these candidates may require additional QC if they appear to be interesting to the investigators.
 
  
 
== Download ==
 
== Download ==

Latest revision as of 10:22, 26 October 2016

Update

v0.01 is available for download

Compilation

  • After downloading the source code, unzip and untar it, and cd bayesdenovo, and then type Make
  • If you encountered errors related to deprecated usage of some syntax please try to comment out the following in the core/Makefile
CXXFLAGS += -Werror -Wno-unused-variable -Wno-unused-result

Introduction

  • The program bayesdenovo implemented a Bayesian framework for calling de novo mutations in nuclear families (including trios, quartets, and families with more siblings) for next-generation sequencing data. If infers Identity-by-Descednt (IBD) allele sharing to increase the de novo mutation calling accuracy. As a result, the IBD sharing for the called de novo mutations is also available in the output file.
  • It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
  • It calculates the likelihood of the model with de novo mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the de novo evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the de novo quality is represented as DQ=log10(BF) = log10(L1/L0).
  • DQ is the major parameter to control the output, along with others. See the example output file below
  • We recommend some basic and also a more advanced filtering

Usage

A command without any input will invoke triodenovo and display the following message

   *** This build (v.0.1) was compiled on Oct 25 2016, 11:16:22 ***
                      pedfile :                 (-pname)
                      datfile :                 (-dname)
                      mapfile :                 (-mname)
Additional Options
                       Input : --in_vcf [], --submap [1.00]
  Denovo mutation parameters : --tstv_ratio [2.00], --minDQ [5.00]
             Multi-threading : --nthreads [1]
                      Output : --out_prefix []
  • Example 1: using default parameters
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo
  • Example 2: using --minDQ 7 to output de novo calls which are a minimum DQ of 7.
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo --min_DQ 7

Input files

  • A ped file, with 5 colums [see merlin documentation]. An example ped file is as follows (Note that you can mix trios with other nuclear families in the same VCF file):
quartet1 p1  0  0   1
quartet1 p2  0  0   2
quartet1 p3  p1 p2  1
quartet1 p4  p1 p2  1
nuc1 p5  0  0   1
nuc1 p6  0  0   2
nuc1 p7  p1 p2  1
nuc1 p8  p1 p2  1
nuc1 p9  p1 p2  1
trio1 p10  0  0   1
trio1 p11  0  0   2
troi1 p12  p1 p2  1
trio2 p13  0  0   1
trio2 p14  0  0   2
troi2 p15  p1 p2  1
  • A VCF file [VCF specs]. It can contain variant information for more individuals than in the ped file.
    • Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.
  • A map file in the PLINK format. See blow for examples how to generate a map file with common and high quality variants

Examples of generating the map file

  • vcf2map: generate a sparse map file (see Download for files genetic_map_GRCh37_chr1.txt and 1000G.SNV.clean.MAF0.05.tbl.gz)
 vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --out_map chr1.map
  • User defined r2 cutoff for LD pruning , min of average depth for filtering
vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --max_r2 0.2 --min_avg_dp 2 --out_map chr1.r0.2.map

Output

  • The output will be one file per family, and the prefix to the names is specified via --out_prefix

An example of output file is as follows

##fileformat=VCFv4.1 
##ProgramStart=Tue Oct 25 11:20:39 2016
##BayesDeNovo=../bin/bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo 
##Note=VCF file modified by polymutt2. Updated fileds include: QUAL, GT and GQ, and AC. NOTE: modification was applied only to biallelic variants
##FILTER=<ID=LOWDP,Description="Low Depth filter when the average depth per sample is lessn than 1">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Alternative Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Alternative Allele Count">
##INFO=<ID=FDQ,Number=1,Type=Integer,Description="Family-wise De Novo Mutatoin Quality in log10(BF) format">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DQ,Number=1,Type=Integer,Description="De Novo Mutation Quality in log10(BF) format">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=IV,Number=2,Type=Integer,Description="Best path inheritance vector. Founder alleles are arbiturally labeled (1 to 2*nFounders) and L1|L2 for non-founders indicated L1 and L2 from founders are transmitted">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Phred-scaled Genotype Likelihoods">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Rep1_1_1        Rep1_1_2        Rep1_1_3        Rep1_1_4
1       118     .       C       A       100     .       AF=0.333333;AC=7;DP=1629;FDQ=8.667414   GT:DQ:DP:PL     0/0:.:202:1|2:0,100,255 0/0:.:234:3|4:0,100,255 0/1:8.97:203:1|3:100,0,255      0/0:.:206:2|4:0,100,255
1       858     .       C       A       100     .       AF=0.333333;AC=10;DP=1592;FDQ=8.688325  GT:DQ:DP:PL     0/0:.:184:1|2:0,100,255 0/0:.:208:3|4:0,100,255 0/0:.:220:1|3:0,100,255 0/1:8.99:197:2|4:100,0,255


Filtering

We recommend two filtering strategies. The first is a simple filtering and the second one is more advanced. Please see the triodenovo page below for more information:

http://genome.sph.umich.edu/wiki/Triodenovo

Download

Source code of v0.01 download here.

Contact

For questions please contact the authors (Bingshan Li: bingshan.li@vanderbilt.edu)