Triodenovo

From Genome Analysis Wiki
Revision as of 14:27, 20 June 2016 by Bingshan (talk | contribs)
Jump to navigationJump to search

Note

Recently VCF files tend to have records with different FORMAT. To avoid the problem please use --mixed_vcf_records on the command line when running triodenovo.

Update

v0.04 is available for download

Compilation

  • After downloading the source code, unzip and untar it, and cd triodenovo, and then type Make
  • If you encountered errors related to deprecated usage of some syntax please try to comment out the following in the core/Makefile
CXXFLAGS += -Werror -Wno-unused-variable -Wno-unused-result

Introduction

  • The program triodenovo implemented a Bayesian framework for calling de novo mutations in trios for next-generation sequencing data.
  • It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
  • It calculates the likelihood of the model with de novo mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the de novo evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the de novo quality is represented as DQ=log10(BF) = log10(L1/L0).
  • DQ is the major parameter to control the output, along with others. See the example output file below
  • We recommend some basic and also a more advanced filtering

Usage

A command without any input will invoke triodenovo and display the following message

 The following parameters are in effect:
                   Input files : --ped [], --in_vcf []
                  Output files : --out_vcf []
          Denovo mutation rate : --mu [1.0e-07]
          Scaled mutation rate : --theta [1.0e-03], --indel_theta [1.0e-04]
  Prior of de novo ts/tv ratio : --denovo_tstv [2.00]
           Non-autosome labels : --chrX [X]
                       Filters : --minDQ [5.00], --minTotalDepth,
                                 --maxTotalDepth, --minDepth [5], --maxDepth,
                                 --mixed_vcf_records
  • Example 1: using default parameters
triodenovo --ped trio.ped --in_vcf input.vcf --out denovo.vcf
  • Example 2: using --minDQ 7 to output de novo calls which are a minimum DQ of 7.
triodenovo --ped trio.ped --in_vcf input.vcf --out denovo.vcf --minDQ 7
  • Example 3: using --minDepth 10 to output de novo calls for which all three individuals (father, mother and child) have depth >=10.
triodenovo --ped trio.ped --in_vcf input.vcf --out denovo.vcf --minDepth 10
  • Example 4: using --minTotalDepth 1000 to filter sites that have a total reads less than 1000. This option will filter the entire site.
triodenovo --ped trio.ped --in_vcf input.vcf --out denovo.vcf --minTotalDepth 1000
  • Example 5: using --mixed_vcf_records when not all records have the same FORMAT. If NOT all FORMAT columns are the same, it will report error if this option is not provided.
triodenovo --ped trio.ped --in_vcf input.vcf --out denovo.vcf --mixed_vcf_records

Input files

trio1 p1  0  0   1
trio1 p2  0  0   2
trio1 p3  p1 p2  1
  • A VCF file [VCF specs]. It can contain variant information for more individuals than in the ped file.
    • Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.

Output

  • The output file is specified via --out_vcf

An example use is as follows

##fileformat=VCFv4.1
##triodenovo=../src/triodenovo --ped trio.denovo.ped --in_vcf trio.denovo.vcf --out_vcf trio.denovo.vcf.out 
##Note=VCF file modified by polymutt. Updated fileds include: QUAL, GT and GQ, AF and AC. NOTE: modification was applied only to biallelic variants
##FILTER=<ID=LOWDP,Description="Low Depth filter when the average depth per sample is lessn than 1">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Alternative Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Alternative Allele Count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DQ,Number=1,Type=Denovo Quality,Description="Denovo Quality: log10(BF) where BF is Bayes Factor calculated as L(M1)/L(M0) for M1 and M0 representing models with and without de novo mutations">
##FORMAT=<ID=DGQ,Number=1,Type=Integer,Description="Denovo Genotype Quality: -10*log10(post) where post is the posterior probability of the called trio genotypes among all trios with de novo mutations">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Phred-scaled Genotype Likelihoods">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Log10 Genotype Likelihoods">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  3       2       1
1       10      .       C       T       20.71   .       DP=47   GT:DQ:DGQ:DP:PL C/C:7.52:100:15:0,44,255        C/C:7.52:100:16:0,44,255        C/T:7.52:100:16:100,0,199
1       13      .       C       T       117.18  .       DP=34   GT:DQ:DGQ:DP:PL C/C:5.67:100:8:0,23,197 C/C:5.67:100:11:0,32,255        C/T:5.67:100:15:178,0,103
1       17      .       C       A       111.15  .       DP=36   GT:DQ:DGQ:DP:PL C/C:5.37:100:12:0,35,255        C/C:5.37:100:10:0,26,222        A/C:5.37:100:14:181,0,82
1       82      .       C       G       118.40  .       DP=43   GT:DQ:DGQ:DP:PL C/C:6.45:100:14:0,41,255        C/C:6.45:100:13:0,38,255        C/G:6.45:100:16:199,0,100
1       91      .       C       T       28.69   .       DP=36   GT:DQ:DGQ:DP:PL C/C:6.02:100:9:0,26,222 C/C:6.02:100:17:0,50,255        C/T:6.02:100:10:93,0,93

Filtering

We recommend two filtering strategies. The first is a simple filtering and the second one is more advance

1. Basic filtering for SNVs. The following filter will retain sites of single nucleotides with only two alleles, QUAL>=30, and mutations in which parents are homozygous references and child is heterozygote with the heterozygote PL being zero, and the minimum PL of the other two genotypes in offering is 30 (i.e. the genotype likelihood, defined as P(R|G) in which R represents the aligned bases and G is the underlying genotype, of the called het mutation is >1000 than the genotype likelihood of the other two genotypes). These filtering parameters can be tuned as needed in the following command.

less trio.vcf.out | egrep "DQ|#" | perl -lane 'print if /#/; next if length($F[3])>1 || length($F[4])>1 || $F[4]=~/,/; next if $F[5]<30; $F[9] =~ /([A-Z])\/([A-Z])/; next if $1 ne $2; next if $F[10] !~ /$1\/$1/; $F[11]=~/([A-Z])\/([A-Z])/; next if $1 eq $2; $F[11] =~ /(\d+),(\d+),(\d+)/; next if $2 != 0 || $1<30 || $3<30; print' | less

2. Advanced filtering using a machine-learning approach (i.e. DNMFilter in the following webpage)

http://humangenome.duke.edu/software

3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.

  • If you have a multi-sample call VCF it may be helpful to select those mutation candidates that appear only once in your VCF (AC=1 for example). This can be the top tier to consider. Relaxing AC to 2 or 3 can recover more real mutations but also increase false positives.
  • If it is too stringent to filter out known sites, it may be helpful to select candidates that have low (e.g. <0.002)1000G or ESP allele frequencies. Some mutations can occur on know variant sites but mutations with high population frequencies may not be of great interest, if indeed they are real.
  • Candidates in segmental duplications, low complexity regions or other copy number regions may be flagged for further analysis.
  • Candidates for which parents are not hom-ref or offspring is a double mutant are more likely to be due to artifacts so the interpretation of these candidates may require additional QC if they appear to be interesting to the investigators.

Download

Source code of v0.04 download here.

Contact

For questions please contact the authors (Bingshan Li: bingshan.li@vanderbilt.edu)