Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,765 bytes added ,  21:44, 25 February 2016
Line 3: Line 3:  
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
 
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
   −
On this wiki page, we describe a a variant classification system for VCF variants.
+
On this wiki page, we describe a a variant classification system for VCF entries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs.
    
= Definitions =
 
= Definitions =
   −
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types loosely defined as follows.
+
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types loosely decribed as follows.
    
;1. SNP
 
;1. SNP
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
;2. MNP
 
;2. MNP
: a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
+
: The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
 
: OR
 
: OR
: b. all reference and alternate sequences have the same length.
+
: All reference and alternate sequences have the same length (this is applicable to all alleles).
 
;3. INDEL
 
;3. INDEL
: a. The reference and alternate sequence are not the same length.
+
: The reference and alternate sequences are not of the same length.
: AND
  −
: b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
   
;4. CLUMPED
 
;4. CLUMPED
a. A clumping of nearby SNPs, MNPs or Indels.
+
:  A clumping of nearby SNPs, MNPs or Indels.
 
;5. SV
 
;5. SV
: The alternate sequence is represented by a angled bracket tag.
+
: The alternate sequence is represented by an angled bracket tag.
 +
 
 +
= Classification Procedure =
 +
 
 +
#Trim each allele with respect to the reference sequence individually
 +
#Inspect length, defined as length of alternate allele minus length of reference allele.
 +
##if length = 0
 +
###if length(ref) = 1 and nucleotides differ, classify as SNP  (count ts and tv too)
 +
###if length(ref) > 1
 +
####if all nucleotides differ, classify as MNP  (count ts and tv too)
 +
####if not all nucleotides differ, classify as CLUMPED  (count ts and tv too)
 +
##if length <math>\ne</math> 0, classify as INDEL
 +
###if shorter allele is of length 1
 +
####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification
 +
###if shorter allele length > 1
 +
####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
 +
#####if all nucleotides differ, add MNP classification
 +
#####if not all nucleotides differ, add CLUMPED classification
 +
#Variant classification is the union of the classifications of each allele present in the variant.
 +
#If all alleles are the same length, add MNP classification.
    
= Examples =
 
= Examples =
   −
We present the following examples to explain the concepts explained earlier.
+
We present the following examples to explain the classification described.
    
== Legend for examples ==
 
== Legend for examples ==
Line 32: Line 49:  
     &lt;variant classification&gt;<br>
 
     &lt;variant classification&gt;<br>
 
     REF &lt;reference sequence&gt;       
 
     REF &lt;reference sequence&gt;       
     ALT &lt;alternative sequence&gt;      #&lt;allele classification&gt; , &lt;contribution to transition, transversion, insertion or deletion count&gt;   
+
     ALT &lt;alternative sequence 1&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;   
     ALT &lt;alternative sequence&gt;      #&lt;another allele classification&gt;
+
     ALT &lt;alternative sequence 2&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
    
== Simple Biallelic Examples ==
 
== Simple Biallelic Examples ==
Line 39: Line 56:  
     SNP<br>
 
     SNP<br>
 
     REF  A       
 
     REF  A       
     ALT  G   #SNP, 1 ts   
+
     ALT  G     #SNP, 1 ts   
    
     MNP<br>
 
     MNP<br>
 
     REF  AT     
 
     REF  AT     
     ALT  GC    #MPN, 2 ts
+
     ALT  GC    #MNP, 2 ts
    
     INDEL<br>
 
     INDEL<br>
Line 52: Line 69:  
     REF  AT       
 
     REF  AT       
 
     ALT  T    #INDEL, 1 del
 
     ALT  T    #INDEL, 1 del
 +
              #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. 
 +
              #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles.
 +
              #Simple Indel classification should be invariant whether it is left or right aligned.
    
     SV<br>
 
     SV<br>
Line 61: Line 81:  
     SNP|INDEL<br>
 
     SNP|INDEL<br>
 
     REF  AT           
 
     REF  AT           
     ALT  G          #SNP,INDEL, 1 ts
+
     ALT  G          #SNP, INDEL, 1 ts
 +
                    #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually
 +
                    #not defined.  In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which
 +
                    #is a transversion.  In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition
 +
                    #and transversion contribution.  But please be very clear that this is an ambiguous case.  It is better to consider this simply
 +
                    #as a complex variant.
    
     MNP|INDEL<br>
 
     MNP|INDEL<br>
 
     REF  ATT           
 
     REF  ATT           
     ALT  GG          #MNP,INDEL, ts, 1 tv, 1 del
+
     ALT  GG          #MNP, INDEL, 1 ts, 1 tv, 1 del
    
     MNP|CLUMPED<br>
 
     MNP|CLUMPED<br>
 
     REF  ATTTT         
 
     REF  ATTTT         
     ALT  GTTTC      #MNP,CLUMEPD, 2 ts
+
     ALT  GTTTC      #MNP, CLUMPED, 2 ts
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the same length, classified as MNP too.
    
     INDEL|CLUMPED<br>
 
     INDEL|CLUMPED<br>
Line 80: Line 105:  
     SNP<br>
 
     SNP<br>
 
     REF  A       
 
     REF  A       
     ALT  G       #1 ts
+
     ALT  G           #SNP, 1 ts
     ALT  C       #1 tv
+
     ALT  C           #SNP, 1 tv
    
     MNP<br>
 
     MNP<br>
 
     REF  AG       
 
     REF  AG       
     ALT  GC     #1 ts, 1 tv
+
     ALT  GC         #MNP, 1 ts, 1 tv
     ALT  CT     #2 tv  
+
     ALT  CT         #MNP, 2 tv  
    
     INDEL<br>
 
     INDEL<br>
 
     REF  ATTT     
 
     REF  ATTT     
     ALT  ATT     #1 del
+
     ALT  ATT         #INDEL, 1 del
     ALT  ATTTT   #1 ins
+
     ALT  ATTTT       #INDEL, 1 ins
    
== Complex Multiallelic Examples ==
 
== Complex Multiallelic Examples ==
Line 97: Line 122:  
     SNP|MNP<br>
 
     SNP|MNP<br>
 
     REF  AT     
 
     REF  AT     
     ALT  GT     #SNP, 1 ts
+
     ALT  GT         #SNP, 1 ts
     ALT  AC     #SNP, 1 ts
+
     ALT  AC         #SNP, 1 ts
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the sample length, classified as MNP too.
    
     SNP|MNP|CLUMPED<br>
 
     SNP|MNP|CLUMPED<br>
 
     REF  ATTTG     
 
     REF  ATTTG     
     ALT  GTTTC     #CLUMPED, 1 ts, 1 tv
+
     ALT  GTTTC       #CLUMPED, 1 ts, 1 tv
     ALT  ATTTC     #SNP, 1 tv
+
     ALT  ATTTC       #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the sample length, classified as MNP too.
    
     SNP|MNP|INDEL<br>
 
     SNP|MNP|INDEL<br>
 
     REF  GT     
 
     REF  GT     
     ALT  CT   #SNP, 1 tv
+
     ALT  CT         #SNP, 1 tv
     ALT  AG   #MNP, 2 tv
+
     ALT  AG         #MNP, 2 tv
     ALT  GTT   #INDEL, 1 ins
+
     ALT  GTT         #INDEL, 1 ins
    
     SNP|MNP|INDEL|CLUMPED<br>
 
     SNP|MNP|INDEL|CLUMPED<br>
 
     REF  GTTT     
 
     REF  GTTT     
     ALT  CG   #MNP|INDEL, 2 tv, 1 del
+
     ALT  CG         #MNP, INDEL, 2 tv, 1 del
     ALT  AG   #MNP|INDEL, 1 ts, 1 tv
+
     ALT  AG         #MNP, INDEL, 1 ts, 1 tv
     ALT  GTGTG #SNP|INDEL|CLUMPED, 1 tv
+
     ALT  GTGTG       #SNP, INDEL, CLUMPED, 1 tv, 1 ins
 +
 
 +
== Structured Variants Examples ==
 +
 
 +
    SV<br>
 +
    REF  G   
 +
    ALT &lt;INS:ME:LINE1&gt;    #SV
 +
   
 +
    SV<br>
 +
    REF  G   
 +
    ALT &lt;CN4&gt;            #SV
 +
    ALT &lt;CN12&gt;            #SV
 +
 
 +
=Interesting Variant Types =
 +
 
 +
    Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br>
 +
   
 +
 
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR>
 +
 
 +
    This can be induced as
 +
   
 +
    one record considering only the ATTT repeats
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span>
 +
 
 +
    one record with CATT repeats
 +
    20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
    one record with a mix of both repeat types
 +
    20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
= Representation of close by variants =
 +
 
 +
    1:124001690
 +
    TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG
 +
    TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
   −
== Weird Examples ==
+
    a single complex variant
 +
    CHROM POS        REF  ALT
 +
    1    124001690  C    AAA
   −
== Structured Variants Examples ==
+
    an Indel and SNP adjacent to one another
 +
    CHROM POS        REF  ALT
 +
    1    124001689  T    TAA
 +
    1    124001690  C    A
   −
          no. of structural variants        :      41217       
+
Representing it as a single complex variant enforces that both "indel" and "SNP" are always together.
              2 alleles                     :          38079
+
Representing it as 2 separate variants allows both alleles to segregate independently.
                  deletion                  :                13135          &lt;DEL&gt;
  −
                  insertion                  :                16451          &lt;INS&gt;
  −
                      mobile element          :                    16253      &lt;INS:ME&gt;
  −
                        ALU                  :                        12513  &lt;INS:ME:ALU&gt;
  −
                        LINE1                :                        2911  &lt;INS:ME:LINE1&gt;
  −
                        SVA                  :                          829  &lt;INS:ME:SVA&gt;
  −
                      numt                    :                      198      &lt;INS:MT&gt;
  −
                  duplication                :                  664          &lt;DUP&gt;
  −
                  inversion                  :                  100          &lt;INV&gt;
  −
                  copy number variation      :                7729          &lt;CN4&gt;
  −
              >=3 alleles                    :            3138
  −
                  copy number variation      :                3138          &lt;CN4&gt;,&lt;CN8&gt;
      
= Output  =
 
= Output  =
Line 154: Line 207:  
           3 alleles                      :            273 (0.89) [537/601]
 
           3 alleles                      :            273 (0.89) [537/601]
 
           4 alleles                      :              3 (1.00) [9/9] <br>
 
           4 alleles                      :              3 (1.00) [9/9] <br>
       no. of Indel                      :    6600770
+
       no. of Indel                      :    6600770   #also referred to as simple Indels
 
           2 alleles                      :        6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
 
           2 alleles                      :        6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
 
           3 alleles                      :          280892 (8.72) [503977/57807]
 
           3 alleles                      :          280892 (8.72) [503977/57807]
Line 208: Line 261:  
           4 alleles                      :              34 (1.16) [109/94]  
 
           4 alleles                      :              34 (1.16) [109/94]  
 
           >=5 alleles                    :              4 (0.76) [13/17]  <br>
 
           >=5 alleles                    :              4 (0.76) [13/17]  <br>
       no. of Complex Substitutions      :    159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
+
       no. of Complex Substitutions      :    159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 
           3 alleles                      :          71003 (0.69) [35811/51840] (0.34) [34268/100942]
 
           3 alleles                      :          71003 (0.69) [35811/51840] (0.34) [34268/100942]
1,102

edits

Navigation menu