Difference between revisions of "Variant classification"

From Genome Analysis Wiki
Jump to: navigation, search
(Simple Biallelic Examples)
(Representation of close by variants)
 
(89 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
 
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.  
  
On this wiki page, we describe a a variant classification system for VCF variants.
+
On this wiki page, we describe a a variant classification system for VCF entries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs.
  
 
= Definitions =
 
= Definitions =
  
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types as follows.
+
The definition of a variant is based on the definition of each allele with respect to the reference sequence.  We consider 5 major types loosely decribed as follows.
  
 
;1. SNP
 
;1. SNP
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
 
;2. MNP
 
;2. MNP
: a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
+
: The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
 
: OR
 
: OR
: b. all reference and alternate sequences have the same length.
+
: All reference and alternate sequences have the same length (this is applicable to all alleles).
 
;3. INDEL
 
;3. INDEL
: a. The reference and alternate sequence are not the same length.
+
: The reference and alternate sequences are not of the same length.
: AND
 
: b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
 
 
;4. CLUMPED
 
;4. CLUMPED
:   
+
A clumping of nearby SNPs, MNPs or Indels.
 
;5. SV
 
;5. SV
: The alternate sequence is represented by a angled bracket tag - <DEL>, for example.
+
: The alternate sequence is represented by an angled bracket tag.
  
= Simple Biallelic Examples =
+
= Classification Procedure =
 +
 
 +
#Trim each allele with respect to the reference sequence individually
 +
#Inspect length, defined as length of alternate allele minus length of reference allele.
 +
##if length = 0
 +
###if length(ref) = 1 and nucleotides differ, classify as SNP  (count ts and tv too)
 +
###if length(ref) > 1
 +
####if all nucleotides differ, classify as MNP  (count ts and tv too)
 +
####if not all nucleotides differ, classify as CLUMPED  (count ts and tv too)
 +
##if length <math>\ne</math> 0, classify as INDEL
 +
###if shorter allele is of length 1
 +
####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification
 +
###if shorter allele length > 1
 +
####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
 +
#####if all nucleotides differ, add MNP classification
 +
#####if not all nucleotides differ, add CLUMPED classification
 +
#Variant classification is the union of the classifications of each allele present in the variant.
 +
#If all alleles are the same length, add MNP classification.
 +
 
 +
= Examples =
 +
 
 +
We present the following examples to explain the classification described.
 +
 
 +
== Legend for examples ==
 +
 
 +
    &lt;variant classification&gt;<br>
 +
    REF &lt;reference sequence&gt;     
 +
    ALT &lt;alternative sequence 1&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt; 
 +
    ALT &lt;alternative sequence 2&gt;      #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
 +
 
 +
== Simple Biallelic Examples ==
  
 
     SNP<br>
 
     SNP<br>
     REF  A  
+
     REF  A      
     ALT  G
+
     ALT  G     #SNP, 1 ts 
  
 
     MNP<br>
 
     MNP<br>
     REF  AT     
+
     REF  AT    
     ALT  GC
+
    ALT  GC   #MNP, 2 ts
 +
 
 +
    INDEL<br>
 +
    REF  AT     
 +
     ALT  A    #INDEL, 1 del
  
 
     INDEL<br>
 
     INDEL<br>
     REF  AT  
+
     REF  AT    
     ALT  A
+
     ALT  T    #INDEL, 1 del
 +
              #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. 
 +
              #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles.
 +
              #Simple Indel classification should be invariant whether it is left or right aligned.
  
 
     SV<br>
 
     SV<br>
 
     REF  A     
 
     REF  A     
     ALT &lt;DEL&gt;
+
     ALT &lt;DEL&gt; #SV
  
= Complex Biallelic Examples =
+
== Complex Biallelic Examples ==
  
 
     SNP|INDEL<br>
 
     SNP|INDEL<br>
     REF  AT  
+
     REF  AT          
     ALT  G
+
     ALT  G           #SNP, INDEL, 1 ts
 +
                    #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually
 +
                    #not defined.  In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which
 +
                    #is a transversion.  In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition
 +
                    #and transversion contribution.  But please be very clear that this is an ambiguous case.  It is better to consider this simply
 +
                    #as a complex variant.
  
 
     MNP|INDEL<br>
 
     MNP|INDEL<br>
     REF  ATT  
+
     REF  ATT        
     ALT  GC
+
     ALT  GG          #MNP, INDEL, 1 ts, 1 tv, 1 del
  
 
     MNP|CLUMPED<br>
 
     MNP|CLUMPED<br>
     REF  ATTTT  
+
     REF  ATTTT      
     ALT  GTTTC
+
     ALT  GTTTC       #MNP, CLUMPED, 2 ts
 +
                    #since all the alleles are of the same length, classified as MNP too.
  
 
     INDEL|CLUMPED<br>
 
     INDEL|CLUMPED<br>
 
     REF  ATTTTTTTT     
 
     REF  ATTTTTTTT     
     ALT  GTTTC
+
     ALT  GTTTC       #INDEL, CLUMPED, 2 ts, 1 del
  
= Simple Multiallelic Examples =
+
== Simple Multiallelic Examples ==
  
 
     SNP<br>
 
     SNP<br>
     REF  A  
+
     REF  A      
     ALT  G
+
     ALT  G           #SNP, 1 ts
     ALT  C
+
     ALT  C           #SNP, 1 tv
  
 
     MNP<br>
 
     MNP<br>
     REF  AG  
+
     REF  AG    
     ALT  GC
+
     ALT  GC         #MNP, 1 ts, 1 tv
     ALT  CT
+
     ALT  CT         #MNP, 2 tv
  
 
     INDEL<br>
 
     INDEL<br>
 
     REF  ATTT     
 
     REF  ATTT     
     ALT  ATT
+
     ALT  ATT         #INDEL, 1 del
     ALT  AT
+
     ALT  ATTTT      #INDEL, 1 ins
  
= Complex Multiallelic Examples =
+
== Complex Multiallelic Examples ==
  
 
     SNP|MNP<br>
 
     SNP|MNP<br>
 
     REF  AT     
 
     REF  AT     
     ALT  GT
+
     ALT  GT         #SNP, 1 ts
     ALT  AC
+
     ALT  AC         #SNP, 1 ts
 +
                    #since all the alleles are of the sample length, classified as MNP too.
 +
 
 +
    SNP|MNP|CLUMPED<br>
 +
    REF  ATTTG   
 +
    ALT  GTTTC      #CLUMPED, 1 ts, 1 tv
 +
    ALT  ATTTC      #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP
 +
                    #since all the alleles are of the sample length, classified as MNP too.
  
 
     SNP|MNP|INDEL<br>
 
     SNP|MNP|INDEL<br>
 
     REF  GT     
 
     REF  GT     
     ALT  CT
+
     ALT  CT         #SNP, 1 tv
     ALT  AG
+
     ALT  AG         #MNP, 2 tv
     ALT  GTT
+
     ALT  GTT         #INDEL, 1 ins
 +
 
 +
    SNP|MNP|INDEL|CLUMPED<br>
 +
    REF  GTTT   
 +
    ALT  CG          #MNP, INDEL, 2 tv, 1 del
 +
    ALT  AG          #MNP, INDEL, 1 ts, 1 tv
 +
    ALT  GTGTG      #SNP, INDEL, CLUMPED, 1 tv, 1 ins
 +
 
 +
== Structured Variants Examples ==
 +
 
 +
    SV<br>
 +
    REF  G   
 +
    ALT &lt;INS:ME:LINE1&gt;    #SV
 +
   
 +
    SV<br>
 +
    REF  G   
 +
    ALT &lt;CN4&gt;            #SV
 +
    ALT &lt;CN12&gt;            #SV
 +
 
 +
=Interesting Variant Types =
 +
 
 +
    Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br>
 +
   
 +
 
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR>
 +
 
 +
    This can be induced as
 +
   
 +
    one record considering only the ATTT repeats
 +
    20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span>
 +
 
 +
    one record with CATT repeats
 +
    20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
    one record with a mix of both repeat types
 +
    20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 +
 
 +
= Representation of close by variants =
 +
 
 +
    1:124001690
 +
    TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG
 +
    TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
 +
 
 +
    a single complex variant
 +
    CHROM POS        REF  ALT
 +
    1    124001690  C    AAA
 +
 
 +
    an Indel and SNP adjacent to one another
 +
    CHROM POS        REF  ALT
 +
    1    124001689  T    TAA
 +
    1    124001690  C    A
 +
 
 +
Representing it as a single complex variant enforces that both "indel" and "SNP" are always together.
 +
Representing it as 2 separate variants allows both alleles to segregate independently.
  
 
= Output  =
 
= Output  =
  
Summarizes the variants in a VCF file
+
This is the annotated output of peek in the vt suite.
 
 
<div class=" mw-collapsible mw-collapsed">
 
  #summarizes the variants found in mills.vcf
 
  vt peek mills.vcf
 
  
<div class="mw-collapsible-content">
+
stats:no. of samples                    :          0 #number of genotype fields in VCF file, this is a site list so it is 0
   usage : vt peek [options] <in.vcf>
+
      no. of chromosomes                :        25 #no. of chromosomes observed in this file.<br>   
 +
      ========== Micro variants ========== <br>
 +
      no. of SNP                        :   54247827 #total number of SNPs
 +
          2 alleles                      :        53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts
 +
          3 alleles                      :          389190 (0.60) [291224/487156]
 +
          4 alleles                      :          370828 (0.50) [370828/741656]
 +
          >=5 alleles                    :              1 (0.33) [1/3]  <br>
 +
      no. of MNP                        :    122125
 +
          2 alleles                      :          121849 (1.56) [152383/97816]
 +
          3 alleles                      :            273 (0.89) [537/601]
 +
          4 alleles                      :              3 (1.00) [9/9] <br>
 +
      no. of Indel                      :    6600770    #also referred to as simple Indels
 +
          2 alleles                      :        6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
 +
          3 alleles                      :          280892 (8.72) [503977/57807]
 +
          4 alleles                      :          28245 (131.19) [84094/641]
 +
          >=5 alleles                    :            5772 (3847.00) [23082/6] <br>
 +
      no. of SNP/MNP                    :      1161
 +
          3 alleles                      :            1143 (1.57) [1565/994]
 +
          4 alleles                      :              15 (1.36) [34/25]
 +
          >=5 alleles                    :              3 (0.67) [8/12] <br>
 +
      no. of SNP/Indel                  :    115153
 +
          2 alleles                      :          42717 (0.65) [16778/25939] (0.57) [15441/27276]  #ts/tv and ins/del ratios
 +
          3 alleles                      :          66401 (0.72) [29681/41397] (0.33) [31458/96168]
 +
          4 alleles                      :            4631 (0.55) [2420/4386] (0.25) [2602/10306]
 +
          >=5 alleles                    :           1404 (0.62) [1197/1926] (0.10) [513/4989] <br>
 +
      no. of MNP/Indel                  :      15619
 +
          2 alleles                      :          12820 (0.51) [12099/23648] (0.77) [5594/7226]
 +
          3 alleles                      :            2455 (0.40) [1796/4469] (0.45) [1144/2546]
 +
          4 alleles                      :            292 (0.24) [215/891] (1.42) [415/292]
 +
          >=5 alleles                    :              52 (0.43) [96/225] (2.47) [126/51] <br>
 +
      no. of SNP/MNP/Indel              :        273
 +
          3 alleles                      :            167 (0.63) [201/321] (0.38) [70/184]
 +
          4 alleles                      :              85 (0.35) [71/203] (0.28) [31/111]
 +
          >=5 alleles                    :              21 (0.35) [24/68] (0.68) [25/37]<br>
 +
      no. of MNP/Clumped                :      61175
 +
          2 alleles                      :          60617 (1.68) [84410/50220]
 +
          3 alleles                      :            549 (1.23) [1777/1449]
 +
          4 alleles                      :              8 (1.43) [53/37]
 +
          >=5 alleles                    :              1 (1.00) [5/5]  <br>
 +
      no. of SNP/MNP/Clumped            :        290
 +
          3 alleles                      :            282 (1.35) [665/494]
 +
          4 alleles                      :              8 (0.57) [13/23] <br>
 +
      no. of Indel/Clumped              :      27638
 +
          2 alleles                      :          25971 (0.65) [31435/48526] (0.79) [11444/14527]
 +
          3 alleles                      :            1585 (0.74) [3568/4793] (0.87) [1383/1582]
 +
          4 alleles                      :              70 (0.55) [96/175] (1.61) [124/77]
 +
          >=5 alleles                    :              12 (0.59) [37/63] (4.71) [33/7] <br>
 +
      no. of SNP/Indel/Clumped          :        456
 +
          3 alleles                      :            257 (0.84) [332/394] (0.33) [111/340]
 +
          4 alleles                      :            174 (0.38) [105/279] (0.58) [186/321]
 +
          >=5 alleles                    :              25 (0.19) [12/63] (0.94) [44/47] <br>
 +
      no. of MNP/Indel/Clumped          :        153
 +
          3 alleles                      :            138 (0.50) [233/466] (0.84) [102/122]
 +
          4 alleles                      :              12 (0.35) [14/40] (1.42) [17/12]
 +
          >=5 alleles                    :              3 (0.64) [7/11] (0.67) [4/6] <br>
 +
      no. of SNP/MNP/Indel/Clumped      :          6
 +
          4 alleles                      :              1 (3.00) [3/1] (0.00) [0/3]
 +
          >=5 alleles                    :              5 (0.62) [8/13] (2.00) [12/6] <br>
 +
      no. of Reference                  :          0 <br>
 +
      ====== Other useful categories ===== <br>
 +
      no. of Block Substitutions        :    184751 #equivalent to categories with allele lengths that are the same.
 +
          2 alleles                      :          182466 (1.60) [236793/148036]
 +
          3 alleles                      :            2247 (1.28) [4544/3538]
 +
          4 alleles                      :              34 (1.16) [109/94]
 +
          >=5 alleles                    :              4 (0.76) [13/17]  <br>
 +
      no. of Complex Substitutions      :    159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels
 +
          2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 +
          3 alleles                      :          71003 (0.69) [35811/51840] (0.34) [34268/100942]
 +
          4 alleles                      :            5265 (0.49) [2924/5975] (0.30) [3375/11122]
 +
          >=5 alleles                    :            1522 (0.58) [1381/2369] (0.15) [757/5143] <br>
 +
      ======= Structural variants ========<br>
 +
      no. of structural variants        :      41217
 +
          2 alleles                      :          38079
 +
              deletion                  :                13135
 +
              insertion                  :                16451
 +
                  mobile element          :                    16253
 +
                    ALU                  :                        12513
 +
                    LINE1                :                        2911
 +
                    SVA                  :                          829
 +
                  numt                    :                      198
 +
              duplication                :                  664
 +
              inversion                  :                  100
 +
              copy number variation      :                7729
 +
          >=3 alleles                    :            3138
 +
              copy number variation      :                3138 <br>
 +
      ========= General summary ========== <br>
 +
      no. of observed variants          :  79449759
 +
      no. of unclassified variants      :          0
  
  options : -o  output VCF file [-]
+
= Implementation =
            -I  file containing list of intervals []
 
            -i  intervals []
 
            -r  reference sequence fasta file []
 
            --  ignores the rest of the labeled arguments following this flag
 
            -h  displays help
 
</div>
 
</div>
 
  
#This is a sample output of a peek command which summarizes the variants found in a VCF file.
+
This is implemented in [http://genome.sph.umich.edu/wiki/Vt#Peek vt].
  stats: no. of samples                    :          0
 
          no. of chromosomes                :        22<br>
 
          ========== Micro variants ==========<br>
 
          no. of SNPs                        :  77228885
 
              2 alleles (ts/tv)              :        77011302 (2.11) [52287790/24723512]
 
              3 alleles (ts/tv)              :         216560 (0.75) [185520/247600]
 
              4 alleles (ts/tv)              :            1023 (0.50) [1023/2046]<br>
 
          no. of MNPs                        :          0
 
              2 alleles (ts/tv)              :              0 (-nan) [0/0]
 
              >=3 alleles (ts/tv)            :              0 (-nan) [0/0]<br>
 
          no. Indels                        :    2147564
 
              2 alleles (ins/del)            :        2124842 (0.47) [683250/1441592]
 
              >=3 alleles (ins/del)          :          22722 (2.12) [32411/15286]<br>
 
          no. SNP/MNP                        :          0
 
              3 alleles (ts/tv)              :              0 (-nan) [0/0]
 
              >=4 alleles (ts/tv)            :              0 (-nan) [0/0] <br>
 
          no. SNP/Indels                    :      12913
 
              2 alleles (ts/tv) (ins/del)    :            412 (0.41) [120/292] (3.68) [324/88]
 
              >=3 alleles (ts/tv) (ins/del)  :          12501 (0.43) [7670/17649] (18.64) [12434/667]<br>
 
          no. MNP/Indels                    :        153
 
              2 alleles (ts/tv) (ins/del)    :              0 (-nan) [0/0] (-nan) [0/0]
 
              >=3 alleles (ts/tv) (ins/del)  :            153 (0.30) [138/465] (0.27) [67/248]<br>
 
          no. SNP/MNP/Indels                :          2
 
              3 alleles (ts/tv) (ins/del)    :              0 (-nan) [0/0] (-nan) [0/0]
 
              4 alleles (ts/tv) (ins/del)    :              2 (0.00) [3/5] (1.00) [3/3]
 
              >=5 alleles (ts/tv) (ins/del)  :              0 (-nan) [0/0] (-nan) [0/0]<br>
 
          no. of clumped variants            :      19025
 
              2 alleles                      :              0 (-nan) [0/0] (-nan) [0/0]
 
              3 alleles                      :          18508 (0.16) [12152/75366] (0.00) [93/18653]
 
              4 alleles                      :            451 (0.15) [369/2390] (0.33) [201/609]
 
              >=5 alleles                    :              66 (0.09) [37/414] (1.19) [107/90]<br>
 
          ====== Other useful categories =====<br>
 
          no. complex variants              :      32093
 
              2 alleles (ts/tv) (ins/del)    :            412 (0.41) [120/292] (3.68) [324/88]
 
              >=3 alleles (ts/tv) (ins/del)  :          31681 (0.21) [20369/96289] (0.64) [12905/20270]<br>
 
          ======= Structural variants ========<br>
 
          no. of structural variants        :      41217
 
              2 alleles                      :          38079
 
                  deletion                  :                13135
 
                  insertion                  :                16451
 
                    mobile element          :                    16253
 
                        ALU                  :                        12513
 
                        LINE1                :                        2911
 
                        SVA                  :                          829
 
                    numt                    :                      198
 
                  duplication                :                  664
 
                  inversion                  :                  100
 
                  copy number variation      :                7729
 
              >=3 alleles                    :            3138
 
                  copy number variation      :                3138 <br>
 
          ========= General summary ========== <br>
 
          no. of reference                  :          0 <br>
 
          no. of observed variants          :  79449759
 
          no. of unclassified variants      :          0
 
  
 
= Maintained by =
 
= Maintained by =
  
 
This page is maintained by  [mailto:atks@umich.edu Adrian].
 
This page is maintained by  [mailto:atks@umich.edu Adrian].

Latest revision as of 21:44, 25 February 2016

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.

On this wiki page, we describe a a variant classification system for VCF entries that is invariant to normalization except for the case of MNPs.

Definitions

The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely decribed as follows.

1. SNP
The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
2. MNP
The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
OR
All reference and alternate sequences have the same length (this is applicable to all alleles).
3. INDEL
The reference and alternate sequences are not of the same length.
4. CLUMPED
A clumping of nearby SNPs, MNPs or Indels.
5. SV
The alternate sequence is represented by an angled bracket tag.

Classification Procedure

  1. Trim each allele with respect to the reference sequence individually
  2. Inspect length, defined as length of alternate allele minus length of reference allele.
    1. if length = 0
      1. if length(ref) = 1 and nucleotides differ, classify as SNP (count ts and tv too)
      2. if length(ref) > 1
        1. if all nucleotides differ, classify as MNP (count ts and tv too)
        2. if not all nucleotides differ, classify as CLUMPED (count ts and tv too)
    2. if length \ne 0, classify as INDEL
      1. if shorter allele is of length 1
        1. if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification
      2. if shorter allele length > 1
        1. compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
          1. if all nucleotides differ, add MNP classification
          2. if not all nucleotides differ, add CLUMPED classification
  3. Variant classification is the union of the classifications of each allele present in the variant.
  4. If all alleles are the same length, add MNP classification.

Examples

We present the following examples to explain the classification described.

Legend for examples

   <variant classification>
REF <reference sequence> ALT <alternative sequence 1> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> ALT <alternative sequence 2> #<allele classification>, <contribution to transition, transversion, insertion or deletion count>

Simple Biallelic Examples

   SNP
REF A ALT G #SNP, 1 ts
   MNP
REF AT ALT GC #MNP, 2 ts
   INDEL
REF AT ALT A #INDEL, 1 del
   INDEL
REF AT ALT T #INDEL, 1 del #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles. #Simple Indel classification should be invariant whether it is left or right aligned.
   SV
REF A ALT <DEL> #SV

Complex Biallelic Examples

   SNP|INDEL
REF AT ALT G #SNP, INDEL, 1 ts #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually #not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which #is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition #and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply #as a complex variant.
   MNP|INDEL
REF ATT ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del
   MNP|CLUMPED
REF ATTTT ALT GTTTC #MNP, CLUMPED, 2 ts #since all the alleles are of the same length, classified as MNP too.
   INDEL|CLUMPED
REF ATTTTTTTT ALT GTTTC #INDEL, CLUMPED, 2 ts, 1 del

Simple Multiallelic Examples

   SNP
REF A ALT G #SNP, 1 ts ALT C #SNP, 1 tv
   MNP
REF AG ALT GC #MNP, 1 ts, 1 tv ALT CT #MNP, 2 tv
   INDEL
REF ATTT ALT ATT #INDEL, 1 del ALT ATTTT #INDEL, 1 ins

Complex Multiallelic Examples

   SNP|MNP
REF AT ALT GT #SNP, 1 ts ALT AC #SNP, 1 ts #since all the alleles are of the sample length, classified as MNP too.
   SNP|MNP|CLUMPED
REF ATTTG ALT GTTTC #CLUMPED, 1 ts, 1 tv ALT ATTTC #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP #since all the alleles are of the sample length, classified as MNP too.
   SNP|MNP|INDEL
REF GT ALT CT #SNP, 1 tv ALT AG #MNP, 2 tv ALT GTT #INDEL, 1 ins
   SNP|MNP|INDEL|CLUMPED
REF GTTT ALT CG #MNP, INDEL, 2 tv, 1 del ALT AG #MNP, INDEL, 1 ts, 1 tv ALT GTGTG #SNP, INDEL, CLUMPED, 1 tv, 1 ins

Structured Variants Examples

   SV
REF G ALT <INS:ME:LINE1> #SV SV
REF G ALT <CN4> #SV ALT <CN12> #SV

Interesting Variant Types

   Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. 
   20	9538655		ATTTATTTATTTATTTATTTATTTATTTATTTATTTATTCATTCATTCATTCATTCATTCATTC 	<STR>
   This can be induced as
   
   one record considering only the ATTT repeats
   20	9538655	ATTTATTTATTT  ATTT 
   one record with CATT repeats
   20	9538695	CATTCATT  CATT 
   one record with a mix of both repeat types
   20	9538695	TATTCATTCATT  CATT 

Representation of close by variants

    1:124001690
    TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG
    TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
    a single complex variant
    CHROM POS         REF   ALT
    1     124001690   C     AAA
    an Indel and SNP adjacent to one another
    CHROM POS         REF   ALT
    1     124001689   T     TAA
    1     124001690   C     A

Representing it as a single complex variant enforces that both "indel" and "SNP" are always together. Representing it as 2 separate variants allows both alleles to segregate independently.

Output

This is the annotated output of peek in the vt suite.

stats:no. of samples                     :          0 #number of genotype fields in VCF file, this is a site list so it is 0
      no. of chromosomes                 :         25 #no. of chromosomes observed in this file.
========== Micro variants ==========
no. of SNP  : 54247827 #total number of SNPs 2 alleles  : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts 3 alleles  : 389190 (0.60) [291224/487156] 4 alleles  : 370828 (0.50) [370828/741656] >=5 alleles  : 1 (0.33) [1/3]
no. of MNP  : 122125 2 alleles  : 121849 (1.56) [152383/97816] 3 alleles  : 273 (0.89) [537/601] 4 alleles  : 3 (1.00) [9/9]
no. of Indel  : 6600770 #also referred to as simple Indels 2 alleles  : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts 3 alleles  : 280892 (8.72) [503977/57807] 4 alleles  : 28245 (131.19) [84094/641] >=5 alleles  : 5772 (3847.00) [23082/6]
no. of SNP/MNP  : 1161 3 alleles  : 1143 (1.57) [1565/994] 4 alleles  : 15 (1.36) [34/25] >=5 alleles  : 3 (0.67) [8/12]
no. of SNP/Indel  : 115153 2 alleles  : 42717 (0.65) [16778/25939] (0.57) [15441/27276] #ts/tv and ins/del ratios 3 alleles  : 66401 (0.72) [29681/41397] (0.33) [31458/96168] 4 alleles  : 4631 (0.55) [2420/4386] (0.25) [2602/10306] >=5 alleles  : 1404 (0.62) [1197/1926] (0.10) [513/4989]
no. of MNP/Indel  : 15619 2 alleles  : 12820 (0.51) [12099/23648] (0.77) [5594/7226] 3 alleles  : 2455 (0.40) [1796/4469] (0.45) [1144/2546] 4 alleles  : 292 (0.24) [215/891] (1.42) [415/292] >=5 alleles  : 52 (0.43) [96/225] (2.47) [126/51]
no. of SNP/MNP/Indel  : 273 3 alleles  : 167 (0.63) [201/321] (0.38) [70/184] 4 alleles  : 85 (0.35) [71/203] (0.28) [31/111] >=5 alleles  : 21 (0.35) [24/68] (0.68) [25/37]
no. of MNP/Clumped  : 61175 2 alleles  : 60617 (1.68) [84410/50220] 3 alleles  : 549 (1.23) [1777/1449] 4 alleles  : 8 (1.43) [53/37] >=5 alleles  : 1 (1.00) [5/5]
no. of SNP/MNP/Clumped  : 290 3 alleles  : 282 (1.35) [665/494] 4 alleles  : 8 (0.57) [13/23]
no. of Indel/Clumped  : 27638 2 alleles  : 25971 (0.65) [31435/48526] (0.79) [11444/14527] 3 alleles  : 1585 (0.74) [3568/4793] (0.87) [1383/1582] 4 alleles  : 70 (0.55) [96/175] (1.61) [124/77] >=5 alleles  : 12 (0.59) [37/63] (4.71) [33/7]
no. of SNP/Indel/Clumped  : 456 3 alleles  : 257 (0.84) [332/394] (0.33) [111/340] 4 alleles  : 174 (0.38) [105/279] (0.58) [186/321] >=5 alleles  : 25 (0.19) [12/63] (0.94) [44/47]
no. of MNP/Indel/Clumped  : 153 3 alleles  : 138 (0.50) [233/466] (0.84) [102/122] 4 alleles  : 12 (0.35) [14/40] (1.42) [17/12] >=5 alleles  : 3 (0.64) [7/11] (0.67) [4/6]
no. of SNP/MNP/Indel/Clumped  : 6 4 alleles  : 1 (3.00) [3/1] (0.00) [0/3] >=5 alleles  : 5 (0.62) [8/13] (2.00) [12/6]
no. of Reference  : 0
====== Other useful categories =====
no. of Block Substitutions  : 184751 #equivalent to categories with allele lengths that are the same. 2 alleles  : 182466 (1.60) [236793/148036] 3 alleles  : 2247 (1.28) [4544/3538] 4 alleles  : 34 (1.16) [109/94] >=5 alleles  : 4 (0.76) [13/17]
no. of Complex Substitutions  : 159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels 2 alleles  : 81508 (0.61) [60312/98113] (0.66) [32479/49029] 3 alleles  : 71003 (0.69) [35811/51840] (0.34) [34268/100942] 4 alleles  : 5265 (0.49) [2924/5975] (0.30) [3375/11122] >=5 alleles  : 1522 (0.58) [1381/2369] (0.15) [757/5143]
======= Structural variants ========
no. of structural variants  : 41217 2 alleles  : 38079 deletion  : 13135 insertion  : 16451 mobile element  : 16253 ALU  : 12513 LINE1  : 2911 SVA  : 829 numt  : 198 duplication  : 664 inversion  : 100 copy number variation  : 7729 >=3 alleles  : 3138 copy number variation  : 3138
========= General summary ==========
no. of observed variants  : 79449759 no. of unclassified variants  : 0

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.