Changes

From Genome Analysis Wiki
Jump to navigationJump to search
828 bytes removed ,  10:22, 26 October 2016
Line 8: Line 8:     
== Introduction ==
 
== Introduction ==
* The program '''bayesdenovo''' implemented a Bayesian framework for calling '''''de novo''''' mutations in '''nuclear families''' (including trios, quartets, and families with more siblings) for next-generation sequencing data.  
+
* The program '''bayesdenovo''' implemented a Bayesian framework for calling '''''de novo''''' mutations in '''nuclear families''' (including trios, quartets, and families with more siblings) for next-generation sequencing data. If infers Identity-by-Descednt (IBD) allele sharing to increase the '''''de novo''''' mutation calling accuracy. As a result, the IBD sharing for the called '''''de novo''''' mutations is also available in the output file.
 
* It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
 
* It takes as input a standard VCF file with PL or GL fields (storing genotype likelihoods). Commonly used callers, e.g. GATK and samtools, generate VCF files with PL values.
 
* It calculates the likelihood of the model with '''''de novo''''' mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the '''''de novo''''' evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the  '''''de novo'''''  quality is represented as DQ=log10(BF) = log10(L1/L0).
 
* It calculates the likelihood of the model with '''''de novo''''' mutations, denoted as L1, and the likelihood of Mendelian transmission, denoted as L0, and represent the '''''de novo''''' evidence using a Bayesian factor BF=L1/L0. In TrioDeNovo the  '''''de novo'''''  quality is represented as DQ=log10(BF) = log10(L1/L0).
Line 35: Line 35:     
== Input files ==
 
== Input files ==
* A ped file, with 5 colums [[http://www.sph.umich.edu/csg/abecasis/merlin/tour/ see merlin documentation]]. An example ped file is as follows
+
* A ped file, with 5 colums [[http://www.sph.umich.edu/csg/abecasis/merlin/tour/ see merlin documentation]]. An example ped file is as follows (Note that you can mix trios with other nuclear families in the same VCF file):
 
  quartet1 p1  0  0  1
 
  quartet1 p1  0  0  1
 
  quartet1 p2  0  0  2
 
  quartet1 p2  0  0  2
 
  quartet1 p3  p1 p2  1
 
  quartet1 p3  p1 p2  1
 +
quartet1 p4  p1 p2  1
 +
nuc1 p5  0  0  1
 +
nuc1 p6  0  0  2
 +
nuc1 p7  p1 p2  1
 +
nuc1 p8  p1 p2  1
 +
nuc1 p9  p1 p2  1
 +
trio1 p10  0  0  1
 +
trio1 p11  0  0  2
 +
troi1 p12  p1 p2  1
 +
trio2 p13  0  0  1
 +
trio2 p14  0  0  2
 +
troi2 p15  p1 p2  1
    
* A VCF file [[http://www.1000genomes.org/node/101 VCF specs]]. It can contain variant information for more individuals than in the ped file.
 
* A VCF file [[http://www.1000genomes.org/node/101 VCF specs]]. It can contain variant information for more individuals than in the ped file.
 
** Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.
 
** Note: In the VCF file either PL or GL has to be provided, and only the PL (or GL) field is used in the calling.
 +
 +
* A map file in the PLINK format. See blow for examples how to generate a map file with common and high quality variants
 +
 +
== Examples of generating the map file ==
 +
* vcf2map: generate a sparse map file (see [[#Download|Download]] for files genetic_map_GRCh37_chr1.txt  and 1000G.SNV.clean.MAF0.05.tbl.gz)
 +
  vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --out_map chr1.map
 +
 +
* User defined r2 cutoff for LD pruning , min of average depth for filtering
 +
vcf2map --vcf input.vcf --ped input.ped --map genetic_map_GRCh37_chr1.txt --include_list 1000G.SNV.clean.MAF0.05.tbl.gz --max_r2 0.2 --min_avg_dp 2 --out_map chr1.r0.2.map
    
== Output ==
 
== Output ==
Line 68: Line 89:     
== Filtering ==
 
== Filtering ==
We recommend two filtering strategies. The first is a simple filtering and the second one is more advance
+
We recommend two filtering strategies. The first is a simple filtering and the second one is more advanced. Please see the triodenovo page below for more information:
 
  −
1. Basic filtering for SNVs. The following filter will retain sites of single nucleotides with only two alleles, QUAL>=30, and mutations in which parents are homozygous references and child is heterozygote with the heterozygote PL being zero, and the minimum PL of the other two genotypes in offering is 30 (i.e. the genotype likelihood, defined as P(R|G) in which R represents the aligned bases and G is the underlying genotype, of the called het mutation is >1000 than the genotype likelihood of the other two genotypes). These filtering parameters can be tuned as needed in the following command.
  −
 
  −
less trio.vcf.out | egrep "DQ|#" | perl -lane 'print if /#/; next if length($F[3])>1 || length($F[4])>1 || $F[4]=~/,/; next if $F[5]<30; $F[9] =~ /([A-Z])\/([A-Z])/; next if $1 ne $2; next if $F[10] !~ /$1\/$1/; $F[11]=~/([A-Z])\/([A-Z])/; next if $1 eq $2; $F[11] =~ /(\d+),(\d+),(\d+)/; next if $2 != 0 || $1<30 || $3<30; print' | less
  −
 
  −
2. Advanced filtering using a machine-learning approach (i.e. DNMFilter in the following webpage)
  −
 
  −
http://humangenome.duke.edu/software
     −
3. Further thoughts about filtering for SNVs without bam files (step 2 requires bam files). There is no consensus on filtering so this can be very flexible.
+
http://genome.sph.umich.edu/wiki/Triodenovo
* If you have a multi-sample call VCF it may be helpful to select those mutation candidates that appear only once in your VCF (AC=1 for example). This can be the top tier to consider. Relaxing AC to 2 or 3 can recover more real mutations but also increase false positives.
  −
* If it is too stringent to filter out known sites, it may be helpful to select candidates that have low (e.g. <0.002)1000G or ESP allele frequencies. Some mutations can occur on know variant sites but mutations with high population frequencies may not be of great interest, if indeed they are real.
  −
* Candidates in segmental duplications, low complexity regions or other copy number regions may be flagged for further analysis.
  −
* Candidates for which parents are not hom-ref or offspring is a double mutant are more likely to be due to artifacts so the interpretation of these candidates may require additional QC if they appear to be interesting to the investigators.
      
== Download ==
 
== Download ==
Source code of v0.05 [[Media:triodenovo.0.05.tar.gz | download]] here.
+
Source code of v0.01 [[Media:bayesdenovo.0.01.tar.gz | download]] here.
    
== Contact ==
 
== Contact ==
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan.li@vanderbilt.edu bingshan.li@vanderbilt.edu])
 
For questions please contact the authors (Bingshan Li:  [mailto:bingshan.li@vanderbilt.edu bingshan.li@vanderbilt.edu])
480

edits

Navigation menu