Changes

From Genome Analysis Wiki
Jump to navigationJump to search
3,533 bytes added ,  21:44, 25 February 2016
Line 3: Line 3:  
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
 
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
   −
On this wiki page, we describe a a variant classification system for VCF variants.
+
On this wiki page, we describe a a variant classification system for VCF entries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs.
    
= Definitions =
 
= Definitions =
   −
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types loosely defined as follows.
+
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types loosely decribed as follows.
    
;1. SNP
 
;1. SNP
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
;2. MNP
 
;2. MNP
: a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
+
: The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
 
: OR
 
: OR
: b. all reference and alternate sequences have the same length.
+
: All reference and alternate sequences have the same length (this is applicable to all alleles).
 
;3. INDEL
 
;3. INDEL
: a. The reference and alternate sequence are not the same length.
+
: The reference and alternate sequences are not of the same length.
: AND
  −
: b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
   
;4. CLUMPED
 
;4. CLUMPED
a. A clumping of nearby SNPs, MNPs or Indels.
+
:  A clumping of nearby SNPs, MNPs or Indels.
 
;5. SV
 
;5. SV
: The alternate sequence is represented by a angled bracket tag.
+
: The alternate sequence is represented by an angled bracket tag.
 +
 
 +
= Classification Procedure =
 +
 
 +
#Trim each allele with respect to the reference sequence individually
 +
#Inspect length, defined as length of alternate allele minus length of reference allele.
 +
##if length = 0
 +
###if length(ref) = 1 and nucleotides differ, classify as SNP  (count ts and tv too)
 +
###if length(ref) > 1
 +
####if all nucleotides differ, classify as MNP  (count ts and tv too)
 +
####if not all nucleotides differ, classify as CLUMPED  (count ts and tv too)
 +
##if length <math>\ne</math> 0, classify as INDEL
 +
###if shorter allele is of length 1
 +
####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification
 +
###if shorter allele length > 1
 +
####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
 +
#####if all nucleotides differ, add MNP classification
 +
#####if not all nucleotides differ, add CLUMPED classification
 +
#Variant classification is the union of the classifications of each allele present in the variant.
 +
#If all alleles are the same length, add MNP classification.
    
= Examples =
 
= Examples =
   −
We present the following examples to explain the concepts explained earlier.
+
We present the following examples to explain the classification described.
 +
 
 +
== Legend for examples ==
 +
 
 +
    &lt;variant classification&gt;<br>
 +
    REF &lt;reference sequence&gt;     
 +
    ALT &lt;alternative sequence 1&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt; 
 +
    ALT &lt;alternative sequence 2&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
    
== Simple Biallelic Examples ==
 
== Simple Biallelic Examples ==
    
     SNP<br>
 
     SNP<br>
     REF  A      #1 ts
+
     REF  A       
     ALT  G    
+
     ALT  G     #SNP, 1 ts 
    
     MNP<br>
 
     MNP<br>
     REF  AT     #2 ts
+
     REF  AT    
     ALT  GC
+
     ALT  GC   #MNP, 2 ts
    
     INDEL<br>
 
     INDEL<br>
     REF  AT      #1 del
+
     REF  AT       
     ALT  A
+
    ALT  A    #INDEL, 1 del
 +
 
 +
    INDEL<br>
 +
    REF  AT     
 +
     ALT  T    #INDEL, 1 del
 +
              #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. 
 +
              #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles.
 +
              #Simple Indel classification should be invariant whether it is left or right aligned.
    
     SV<br>
 
     SV<br>
 
     REF  A     
 
     REF  A     
     ALT &lt;DEL&gt;
+
     ALT &lt;DEL&gt; #SV
    
== Complex Biallelic Examples ==
 
== Complex Biallelic Examples ==
    
     SNP|INDEL<br>
 
     SNP|INDEL<br>
     REF  AT  
+
     REF  AT          
     ALT  G
+
     ALT  G           #SNP, INDEL, 1 ts
 +
                    #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually
 +
                    #not defined.  In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which
 +
                    #is a transversion.  In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition
 +
                    #and transversion contribution.  But please be very clear that this is an ambiguous case.  It is better to consider this simply
 +
                    #as a complex variant.
    
     MNP|INDEL<br>
 
     MNP|INDEL<br>
     REF  ATT  
+
     REF  ATT        
     ALT  GC
+
     ALT  GG          #MNP, INDEL, 1 ts, 1 tv, 1 del
    
     MNP|CLUMPED<br>
 
     MNP|CLUMPED<br>
     REF  ATTTT  
+
     REF  ATTTT      
     ALT  GTTTC
+
     ALT  GTTTC       #MNP, CLUMPED, 2 ts
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the same length, classified as MNP too.
    
     INDEL|CLUMPED<br>
 
     INDEL|CLUMPED<br>
 
     REF  ATTTTTTTT     
 
     REF  ATTTTTTTT     
     ALT  GTTTC
+
     ALT  GTTTC       #INDEL, CLUMPED, 2 ts, 1 del
    
== Simple Multiallelic Examples ==
 
== Simple Multiallelic Examples ==
    
     SNP<br>
 
     SNP<br>
     REF  A  
+
     REF  A      
     ALT  G
+
     ALT  G           #SNP, 1 ts
     ALT  C
+
     ALT  C           #SNP, 1 tv
    
     MNP<br>
 
     MNP<br>
     REF  AG  
+
     REF  AG    
     ALT  GC
+
     ALT  GC         #MNP, 1 ts, 1 tv
     ALT  CT
+
     ALT  CT         #MNP, 2 tv
    
     INDEL<br>
 
     INDEL<br>
 
     REF  ATTT     
 
     REF  ATTT     
     ALT  ATT
+
     ALT  ATT         #INDEL, 1 del
     ALT  AT
+
     ALT  ATTTT      #INDEL, 1 ins
    
== Complex Multiallelic Examples ==
 
== Complex Multiallelic Examples ==
Line 86: Line 122:  
     SNP|MNP<br>
 
     SNP|MNP<br>
 
     REF  AT     
 
     REF  AT     
     ALT  GT     #SNP
+
     ALT  GT         #SNP, 1 ts
     ALT  AC     #SNP
+
     ALT  AC         #SNP, 1 ts
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the sample length, classified as MNP too.
   −
     MNP|CLUMPED<br>
+
     SNP|MNP|CLUMPED<br>
 
     REF  ATTTG     
 
     REF  ATTTG     
     ALT  GTTTC     #CLUMPED
+
     ALT  GTTTC       #CLUMPED, 1 ts, 1 tv
     ALT  ATTTC     #CLUMPED
+
     ALT  ATTTC       #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP
    #since all the alleles are of the sample length, classified as MNP too.
+
                    #since all the alleles are of the sample length, classified as MNP too.
    
     SNP|MNP|INDEL<br>
 
     SNP|MNP|INDEL<br>
 
     REF  GT     
 
     REF  GT     
     ALT  CT   #SNP
+
     ALT  CT         #SNP, 1 tv
     ALT  AG   #MNP
+
     ALT  AG         #MNP, 2 tv
     ALT  GTT   #INDEL
+
     ALT  GTT         #INDEL, 1 ins
    
     SNP|MNP|INDEL|CLUMPED<br>
 
     SNP|MNP|INDEL|CLUMPED<br>
 
     REF  GTTT     
 
     REF  GTTT     
     ALT  CG   #MNP|INDEL
+
     ALT  CG         #MNP, INDEL, 2 tv, 1 del
     ALT  AG   #MNP
+
     ALT  AG         #MNP, INDEL, 1 ts, 1 tv
     ALT  GTGTG #SNP|INDEL|CLUMPED
+
     ALT  GTGTG       #SNP, INDEL, CLUMPED, 1 tv, 1 ins
    
== Structured Variants Examples ==
 
== Structured Variants Examples ==
   −
          no. of structural variants        :      41217       
+
    SV<br>
              2 alleles                      :          38079
+
    REF  G   
                  deletion                  :                13135          &lt;DEL&gt;
+
    ALT &lt;INS:ME:LINE1&gt;   #SV
                  insertion                  :                16451          &lt;INS&gt;
+
   
                      mobile element          :                    16253      &lt;INS:ME&gt;
+
    SV<br>
                        ALU                  :                       12513  &lt;INS:ME:ALU&gt;
+
    REF  G   
                        LINE1                :                         2911  &lt;INS:ME:LINE1&gt;
+
    ALT &lt;CN4&gt;             #SV
                        SVA                  :                         829  &lt;INS:ME:SVA&gt;
+
    ALT &lt;CN12&gt;           #SV
                      numt                    :                     198      &lt;INS:MT&gt;
+
 
                  duplication                :                  664          &lt;DUP&gt;
+
=Interesting Variant Types =
                  inversion                  :                  100          &lt;INV&gt;
+
 
                  copy number variation     :                7729          &lt;CN4&gt;
+
    Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br>
              >=3 alleles                    :            3138
+
   
                  copy number variation     :                3138          &lt;CN4&gt;,&lt;CN8&gt;
+
 
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR>
 +
 
 +
    This can be induced as
 +
   
 +
    one record considering only the ATTT repeats
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span>
 +
 
 +
    one record with CATT repeats
 +
    20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
    one record with a mix of both repeat types
 +
    20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
= Representation of close by variants =
 +
 
 +
    1:124001690
 +
    TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG
 +
    TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
 +
 
 +
    a single complex variant
 +
    CHROM POS        REF  ALT
 +
    1    124001690  C    AAA
 +
 
 +
    an Indel and SNP adjacent to one another
 +
     CHROM POS        REF  ALT
 +
    1    124001689  T    TAA
 +
     1    124001690  C    A
 +
 
 +
Representing it as a single complex variant enforces that both "indel" and "SNP" are always together.
 +
Representing it as 2 separate variants allows both alleles to segregate independently.
    
= Output  =
 
= Output  =
Line 141: Line 207:  
           3 alleles                      :            273 (0.89) [537/601]
 
           3 alleles                      :            273 (0.89) [537/601]
 
           4 alleles                      :              3 (1.00) [9/9] <br>
 
           4 alleles                      :              3 (1.00) [9/9] <br>
       no. of Indel                      :    6600770
+
       no. of Indel                      :    6600770   #also referred to as simple Indels
 
           2 alleles                      :        6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
 
           2 alleles                      :        6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
 
           3 alleles                      :          280892 (8.72) [503977/57807]
 
           3 alleles                      :          280892 (8.72) [503977/57807]
Line 195: Line 261:  
           4 alleles                      :              34 (1.16) [109/94]  
 
           4 alleles                      :              34 (1.16) [109/94]  
 
           >=5 alleles                    :              4 (0.76) [13/17]  <br>
 
           >=5 alleles                    :              4 (0.76) [13/17]  <br>
       no. of Complex Substitutions      :    159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
+
       no. of Complex Substitutions      :    159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 
           3 alleles                      :          71003 (0.69) [35811/51840] (0.34) [34268/100942]
 
           3 alleles                      :          71003 (0.69) [35811/51840] (0.34) [34268/100942]
1,102

edits

Navigation menu