Bayesdenovo

From Genome Analysis Wiki
Jump to navigationJump to search

Update

v0.01 is available for download

Compilation

  • After downloading the source code, unzip and untar it, and cd bayesdenovo, and then type Make
  • If you encountered errors related to deprecated usage of some syntax please try to comment out the following in the core/Makefile
CXXFLAGS += -Werror -Wno-unused-variable -Wno-unused-result

Introduction

  • The program bayesdenovo implemented a Bayesian framework for calling de novo mutations in nuclear families (including trios, quartets, and families with more siblings) for next-generation sequencing data. If infers Identity-by-Descednt (IBD) allele sharing to increase the de novo mutation calling accuracy. As a result, the IBD sharing for the called de novo mutations is also available in the output file.
  • It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
  • It calculates the likelihood of the model with de novo mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the de novo evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the de novo quality is represented as DQ=log10(BF) = log10(L1/L0).
  • DQ is the major parameter to control the output, along with others. See the example output file below
  • We recommend some basic and also a more advanced filtering

Usage

A command without any input will invoke triodenovo and display the following message

   *** This build (v.0.1) was compiled on Oct 25 2016, 11:16:22 ***
                      pedfile :                 (-pname)
                      datfile :                 (-dname)
                      mapfile :                 (-mname)
Additional Options
                       Input : --in_vcf [], --submap [1.00]
  Denovo mutation parameters : --tstv_ratio [2.00], --minDQ [5.00]
             Multi-threading : --nthreads [1]
                      Output : --out_prefix []
  • Example 1: using default parameters
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo
  • Example 2: using --minDQ 7 to output de novo calls which are a minimum DQ of 7.
bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo --min_DQ 7

Input files

  • A ped file, with 5 colums [see merlin documentation]. An example ped file is as follows (Note that you can mix trios with other nuclear families in the same VCF file):
quartet1 p1  0  0   1
quartet1 p2  0  0   2
quartet1 p3  p1 p2  1
quartet1 p4  p1 p2  1
nuc1 p5  0  0   1
nuc1 p6  0  0   2
nuc1 p7  p1 p2  1
nuc1 p8  p1 p2  1
nuc1 p9  p1 p2  1
trio1 p10  0  0   1
trio1 p11  0  0   2
troi1 p12  p1 p2  1
trio2 p13  0  0   1
trio2 p14  0  0   2
troi2 p15  p1 p2  1
  • A VCF file [VCF specs]. It can contain variant information for more individuals than in the ped file.
    • Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.
  • A map file in the PLINK format. See blow for examples how to generate a high quality map file.

Examples of generating the map file

  • vcf2map: generate a sparse map file (see Download for files genetic_map_GRCh37_chr1.txt and 1000G.SNV.clean.MAF0.05.tbl.gz)
 vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --out_map chr1.map
  • User defined r2 cutoff for LD pruning , min of average depth for filtering
vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --max_r2 0.2 --min_avg_dp 2 --out_map chr1.r0.2.map

Output

  • The output will be one file per family, and the prefix to the names is specified via --out_prefix

An example of output file is as follows

##fileformat=VCFv4.1 
##ProgramStart=Tue Oct 25 11:20:39 2016
##BayesDeNovo=../bin/bayesdenovo -p sim.vcf.ped -d sim.vcf.dat -m sim.vcf.map --in_vcf sim.vcf --out_prefix sim.vcf.denovo 
##Note=VCF file modified by polymutt2. Updated fileds include: QUAL, GT and GQ, and AC. NOTE: modification was applied only to biallelic variants
##FILTER=<ID=LOWDP,Description="Low Depth filter when the average depth per sample is lessn than 1">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Alternative Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Alternative Allele Count">
##INFO=<ID=FDQ,Number=1,Type=Integer,Description="Family-wise De Novo Mutatoin Quality in log10(BF) format">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DQ,Number=1,Type=Integer,Description="De Novo Mutation Quality in log10(BF) format">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=IV,Number=2,Type=Integer,Description="Best path inheritance vector. Founder alleles are arbiturally labeled (1 to 2*nFounders) and L1|L2 for non-founders indicated L1 and L2 from founders are transmitted">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Phred-scaled Genotype Likelihoods">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Rep1_1_1        Rep1_1_2        Rep1_1_3        Rep1_1_4
1       118     .       C       A       100     .       AF=0.333333;AC=7;DP=1629;FDQ=8.667414   GT:DQ:DP:PL     0/0:.:202:1|2:0,100,255 0/0:.:234:3|4:0,100,255 0/1:8.97:203:1|3:100,0,255      0/0:.:206:2|4:0,100,255
1       858     .       C       A       100     .       AF=0.333333;AC=10;DP=1592;FDQ=8.688325  GT:DQ:DP:PL     0/0:.:184:1|2:0,100,255 0/0:.:208:3|4:0,100,255 0/0:.:220:1|3:0,100,255 0/1:8.99:197:2|4:100,0,255


Filtering

We recommend two filtering strategies. The first is a simple filtering and the second one is more advance. Please see the triodenovo page below for more information:

http://genome.sph.umich.edu/wiki/Triodenovo

3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.

  • If you have a multi-sample call VCF it may be helpful to select those mutation candidates that appear only once in your VCF (AC=1 for example). This can be the top tier to consider. Relaxing AC to 2 or 3 can recover more real mutations but also increase false positives.
  • If it is too stringent to filter out known sites, it may be helpful to select candidates that have low (e.g. <0.002)1000G or ESP allele frequencies. Some mutations can occur on know variant sites but mutations with high population frequencies may not be of great interest, if indeed they are real.
  • Candidates in segmental duplications, low complexity regions or other copy number regions may be flagged for further analysis.
  • Candidates for which parents are not hom-ref or offspring is a double mutant are more likely to be due to artifacts so the interpretation of these candidates may require additional QC if they appear to be interesting to the investigators.

Download

Source code of v0.01 download here.

Contact

For questions please contact the authors (Bingshan Li: bingshan.li@vanderbilt.edu)