Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,994 bytes added ,  18:12, 15 June 2014
Line 5: Line 5:  
=Tools=
 
=Tools=
   −
You can download [[vt|vt]] and have some working knowledge of PERL to do stuff that vt does not support.
+
This walkthrough requires  [[vt|vt]].
    
=Analyses=
 
=Analyses=
 +
 +
The file generated from the indel calling is a binary version [[http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2 BCFv2.1]] of the Variant Call Format (VCF).  BCFv2.1 is more efficient to process as the data is already stored in computer readable format on the hard disk.  It is however not necessarily more compact than VCF4.2 especially when the format fields are rich in details.
    
==File Preparation==
 
==File Preparation==
Line 14: Line 16:     
To convert to BCF format which will work fast with vt:
 
To convert to BCF format which will work fast with vt:
 +
 +
    
   vt view mills.vcf -o mills.bcf
 
   vt view mills.vcf -o mills.bcf
Line 194: Line 198:  
Also remember to index this file and extract the sites.
 
Also remember to index this file and extract the sites.
   −
==Coding regions==
+
==Insertion/Deletion ratios, Coding Regions and Overlap analysis==
 
  −
The proportion of frameshift Indels amongst coding region indels is a potential indicator of quality.
     −
You can obtain it by using the profile_indels analysis.
+
You can obtain measure of insertion deletion ratios, coding region indels and sensitivity analysis by using the profile_indels analysis.
 
   
 
   
   vt profile_indels -g ~/ref/vt/grch37/indel.reference.txt -r ~/ref/vt/grch37/hs37d5.fa mills.normalized.sites.bcf
+
   vt profile_indels -g indel.reference.txt -r ~/ref/vt/grch37/hs37d5.fa mills.normalized.sites.bcf
    
The indel.reference.txt file contains the required reference to perform the overlap analysis.
 
The indel.reference.txt file contains the required reference to perform the overlap analysis.
   −
==STR ==
+
  data set
 
+
    No Indels :      8904 [0.93]  //#variants in your data set [ins/del ratio]
Annotation of STRs is really important. Show example of a deceptive single base pair variant
+
      FS/NFS :      0.66 (67/35)  //Proportion of frameshift Indels. (#Frameshift Indels/#Nonframeshift Indels)<br> 
 
+
  dbsnp  //A represents the data set you input, B represents dbsnp
 
+
    A-B      2975 [1.06]  //#variants in A only [ins/del ratio]
 
+
    A&B      5929 [0.86]  //#variants in A and B
==Annotation of Indels==
+
    B-A    2059845 [1.51]
 
+
    Precision    66.6%    //A&B/A this represents how novel your data set is in the variants represented.
 
+
    Sensitivity  0.3%    //A&B/B this represents sensitivity somewhat if dbsnp is considered a high quality Indel
 
+
                          //set and the sample are the same in both data sets. (which they usually are not, this is still
==Examining Mendelian Errors==
+
                          //nonetheless a useful indicator)<br>
 
+
  mills
 
+
    A-B      5705 [0.81]
==Useful to have call sets from several different callers==
+
    A&B      3199 [1.18]
 
+
    B-A    203819 [0.98]
 
+
    Precision    35.9%
 
+
    Sensitivity  1.5% <br>
==Concordance==
+
  mills.chip
 +
    A-B          0 [-nan]
 +
    A&B      8904 [0.93]
 +
    B-A          0 [-nan]
 +
    Precision    100.0%
 +
    Sensitivity  100.0% <br>
 +
  affy.exome.chip
 +
    A-B      8821 [0.93]
 +
    A&B        83 [0.69]
 +
    B-A      34011 [0.47]
 +
    Precision    0.9%
 +
    Sensitivity  0.2%
   −
Can check concordance of genotypes between callers
+
Ins/Del ratios:  Reference alignment based methods tend to be biased towards the detection of deletions.  This provides a useful measure for discovery Indel sets to show the varying degree of biasness.  It also appears that as coverage increases, the ins/del ratio tends to 1.
   −
==Overlapping percentages with known data sets==
+
Coding region analysis:  Coding region Indels may be categorised as Frame shift Indels and Non frameshift Indels.  A lower proportion of Frameshift Indels may indicate a better quality data set but this depends also on the individuals sequenced.
With Mills
  −
with dbSNP
  −
with exome chips
  −
with genotyping chips if available
      +
Overlap analysis:  overlap analysis with other data sets is an indicator of sensitivity.
   −
==Useful stratifying features==
+
* dbsnp: contains Indels submitted from everywhere, I am not sure what does this represent exactly.  But assuming most are real, then precision is a useful estimated quantity from this reference data set.
 +
* Mills:  contains doublehit common indels from the Mills. et al paper and is a relatively good measure of sensitivity for common variants.  Because not all Indels in this set is expected to be present in your sample, this actually gives you an underestimate of sensitivity.
 +
* Mills chip:  This is a subset of the Mills data set.  There are genotypes here that are useful for subsetting polymophic subsets of variants that are present in samples common with your data set, this can potentially provide a better estimate of sensitivity.  In general not very useful unless you happen to be working on 1000 Genomes data or any data set who's individuals are commonly studied.
 +
* Affy Exome Chip:  This contains somewhat rare variants in exonic regions and is useful for exome chip analysis. You should subset your exome data to exome region Indels before comparing against this data set.
   −
AF - rare versus common
+
This analysis supports filters too.
Indel length - computed naively versus tract length
  −
Allele frequency bins
  −
Type of Indels - homopolymer types and STR types and isolated
  −
Adjacent SNPs
  −
Adjacent MNPs
  −
Clumping variants
     −
==Other useful evaluations==
+
==to document==
   −
genotype likelihood concordance
+
* Annotation of STRs is really important.  Show example of a deceptive single base pair variant
concordance stratified by indel length or tract length
+
* Mendelian analysis
mendelian concordance by tract length
+
* AFS
 +
* Can check concordance of genotypes between callers - partitiion
 +
* Type of Indels - homopolymer types and STR types and isolated, Adjacent SNPs ,Adjacent MNPs,Clumping variants
 +
* genotype likelihood concordance
 +
* concordance stratified by indel length or tract length
 +
* mendelian concordance by tract length
1,102

edits

Navigation menu