Difference between revisions of "Variant classification"

From Genome Analysis Wiki
Jump to: navigation, search
(Maintained by)
(Output)
Line 165: Line 165:
 
           >=5 alleles                    :              21 (0.35) [24/68] (0.68) [25/37]<br>
 
           >=5 alleles                    :              21 (0.35) [24/68] (0.68) [25/37]<br>
 
       no. of MNP/Clumped                :      61175
 
       no. of MNP/Clumped                :      61175
           2 alleles                      :          60617 (1.68) [84410/50220] (-nan) [0/0]
+
           2 alleles                      :          60617 (1.68) [84410/50220]
           3 alleles                      :            549 (1.23) [1777/1449] (-nan) [0/0]
+
           3 alleles                      :            549 (1.23) [1777/1449]  
           4 alleles                      :              8 (1.43) [53/37] (-nan) [0/0]
+
           4 alleles                      :              8 (1.43) [53/37]  
           >=5 alleles                    :              1 (1.00) [5/5] (-nan) [0/0] <br>
+
           >=5 alleles                    :              1 (1.00) [5/5] <br>
 
       no. of SNP/MNP/Clumped            :        290
 
       no. of SNP/MNP/Clumped            :        290
           3 alleles                      :            282 (1.35) [665/494] (-nan) [0/0]
+
           3 alleles                      :            282 (1.35) [665/494]  
           4 alleles                      :              8 (0.57) [13/23] (-nan) [0/0] <br>
+
           4 alleles                      :              8 (0.57) [13/23] <br>
 
       no. of Indel/Clumped              :      27638
 
       no. of Indel/Clumped              :      27638
 
           2 alleles                      :          25971 (0.65) [31435/48526] (0.79) [11444/14527]
 
           2 alleles                      :          25971 (0.65) [31435/48526] (0.79) [11444/14527]
Line 191: Line 191:
 
       ====== Other useful categories ===== <br>
 
       ====== Other useful categories ===== <br>
 
       no. of Block Substitutions        :    184751 #equivalent to categories with allele lengths that are the same.  
 
       no. of Block Substitutions        :    184751 #equivalent to categories with allele lengths that are the same.  
           2 alleles                      :          182466 (1.60) [236793/148036] (-nan) [0/0]
+
           2 alleles                      :          182466 (1.60) [236793/148036]  
           3 alleles                      :            2247 (1.28) [4544/3538] (-nan) [0/0]
+
           3 alleles                      :            2247 (1.28) [4544/3538]  
           4 alleles                      :              34 (1.16) [109/94] (-nan) [0/0]
+
           4 alleles                      :              34 (1.16) [109/94]  
           >=5 alleles                    :              4 (0.76) [13/17] (-nan) [0/0] <br>
+
           >=5 alleles                    :              4 (0.76) [13/17] <br>
 
       no. of Complex Substitutions      :    159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
 
       no. of Complex Substitutions      :    159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]
 
           2 alleles                      :          81508 (0.61) [60312/98113] (0.66) [32479/49029]

Revision as of 14:27, 5 September 2014

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.

On this wiki page, we describe a a variant classification system for VCF variants.

Definitions

The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely defined as follows.

1. SNP
The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
2. MNP
a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
OR
b. all reference and alternate sequences have the same length.
3. INDEL
a. The reference and alternate sequence are not the same length.
AND
b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
4. CLUMPED
a. A clumping of nearby SNPs, MNPs or Indels.
5. SV
The alternate sequence is represented by a angled bracket tag.

Examples

We present the following examples to explain the concepts explained earlier.

Simple Biallelic Examples

   SNP
REF A ALT G
   MNP
REF AT ALT GC
   INDEL
REF AT ALT A
   SV
REF A ALT <DEL>

Complex Biallelic Examples

   SNP|INDEL
REF AT ALT G
   MNP|INDEL
REF ATT ALT GC
   MNP|CLUMPED
REF ATTTT ALT GTTTC #since all the alleles are of the sample length, classified as MNP too.
   INDEL|CLUMPED
REF ATTTTTTTT ALT GTTTC

Simple Multiallelic Examples

   SNP
REF A ALT G ALT C
   MNP
REF AG ALT GC ALT CT
   INDEL
REF ATTT ALT ATT ALT AT

Complex Multiallelic Examples

   SNP|MNP
REF AT ALT GT #SNP ALT AC #SNP #since all the alleles are of the sample length, classified as MNP too.
   MNP|CLUMPED
REF ATTTG ALT GTTTC #CLUMPED ALT ATTTC #CLUMPED #since all the alleles are of the sample length, classified as MNP too.
   SNP|MNP|INDEL
REF GT ALT CT #SNP ALT AG #MNP ALT GTT #INDEL
   SNP|MNP|INDEL|CLUMPED
REF GTTT ALT CG #MNP|INDEL ALT AG #MNP ALT GTGTG #SNP|INDEL|CLUMPED

SV Examples

          no. of structural variants         :      41217         
              2 alleles                      :           38079
                  deletion                   :                13135           <DEL>
                  insertion                  :                16451           <INS>
                     mobile element          :                    16253       <INS:ME>
                        ALU                  :                        12513   <INS:ME:ALU>
                        LINE1                :                         2911   <INS:ME:LINE1>
                        SVA                  :                          829   <INS:ME:SVA>
                     numt                    :                      198       <INS:MT>
                  duplication                :                  664           <DUP>
                  inversion                  :                  100           <INV>
                  copy number variation      :                 7729           <CN4>
              >=3 alleles                    :            3138
                  copy number variation      :                 3138           <CN4>,<CN8>

Output

This is the annotated output of peek in the vt suite.

stats:no. of samples                     :          0 #number of genotype fields in VCF file, this is a site list so it is 0
      no. of chromosomes                 :         25 #no. of chromosomes observed in this file.
========== Micro variants ==========
no. of SNP  : 54247827 #total number of SNPs 2 alleles  : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts 3 alleles  : 389190 (0.60) [291224/487156] 4 alleles  : 370828 (0.50) [370828/741656] >=5 alleles  : 1 (0.33) [1/3]
no. of MNP  : 122125 2 alleles  : 121849 (1.56) [152383/97816] 3 alleles  : 273 (0.89) [537/601] 4 alleles  : 3 (1.00) [9/9]
no. of Indel  : 6600770 2 alleles  : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts 3 alleles  : 280892 (8.72) [503977/57807] 4 alleles  : 28245 (131.19) [84094/641] >=5 alleles  : 5772 (3847.00) [23082/6]
no. of SNP/MNP  : 1161 3 alleles  : 1143 (1.57) [1565/994] 4 alleles  : 15 (1.36) [34/25] >=5 alleles  : 3 (0.67) [8/12]
no. of SNP/Indel  : 115153 2 alleles  : 42717 (0.65) [16778/25939] (0.57) [15441/27276] #ts/tv and ins/del ratios 3 alleles  : 66401 (0.72) [29681/41397] (0.33) [31458/96168] 4 alleles  : 4631 (0.55) [2420/4386] (0.25) [2602/10306] >=5 alleles  : 1404 (0.62) [1197/1926] (0.10) [513/4989]
no. of MNP/Indel  : 15619 2 alleles  : 12820 (0.51) [12099/23648] (0.77) [5594/7226] 3 alleles  : 2455 (0.40) [1796/4469] (0.45) [1144/2546] 4 alleles  : 292 (0.24) [215/891] (1.42) [415/292] >=5 alleles  : 52 (0.43) [96/225] (2.47) [126/51]
no. of SNP/MNP/Indel  : 273 3 alleles  : 167 (0.63) [201/321] (0.38) [70/184] 4 alleles  : 85 (0.35) [71/203] (0.28) [31/111] >=5 alleles  : 21 (0.35) [24/68] (0.68) [25/37]
no. of MNP/Clumped  : 61175 2 alleles  : 60617 (1.68) [84410/50220] 3 alleles  : 549 (1.23) [1777/1449] 4 alleles  : 8 (1.43) [53/37] >=5 alleles  : 1 (1.00) [5/5]
no. of SNP/MNP/Clumped  : 290 3 alleles  : 282 (1.35) [665/494] 4 alleles  : 8 (0.57) [13/23]
no. of Indel/Clumped  : 27638 2 alleles  : 25971 (0.65) [31435/48526] (0.79) [11444/14527] 3 alleles  : 1585 (0.74) [3568/4793] (0.87) [1383/1582] 4 alleles  : 70 (0.55) [96/175] (1.61) [124/77] >=5 alleles  : 12 (0.59) [37/63] (4.71) [33/7]
no. of SNP/Indel/Clumped  : 456 3 alleles  : 257 (0.84) [332/394] (0.33) [111/340] 4 alleles  : 174 (0.38) [105/279] (0.58) [186/321] >=5 alleles  : 25 (0.19) [12/63] (0.94) [44/47]
no. of MNP/Indel/Clumped  : 153 3 alleles  : 138 (0.50) [233/466] (0.84) [102/122] 4 alleles  : 12 (0.35) [14/40] (1.42) [17/12] >=5 alleles  : 3 (0.64) [7/11] (0.67) [4/6]
no. of SNP/MNP/Indel/Clumped  : 6 4 alleles  : 1 (3.00) [3/1] (0.00) [0/3] >=5 alleles  : 5 (0.62) [8/13] (2.00) [12/6]
no. of Reference  : 0
====== Other useful categories =====
no. of Block Substitutions  : 184751 #equivalent to categories with allele lengths that are the same. 2 alleles  : 182466 (1.60) [236793/148036] 3 alleles  : 2247 (1.28) [4544/3538] 4 alleles  : 34 (1.16) [109/94] >=5 alleles  : 4 (0.76) [13/17]
no. of Complex Substitutions  : 159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels 2 alleles  : 81508 (0.61) [60312/98113] (0.66) [32479/49029] 3 alleles  : 71003 (0.69) [35811/51840] (0.34) [34268/100942] 4 alleles  : 5265 (0.49) [2924/5975] (0.30) [3375/11122] >=5 alleles  : 1522 (0.58) [1381/2369] (0.15) [757/5143]
======= Structural variants ========
no. of structural variants  : 41217 2 alleles  : 38079 deletion  : 13135 insertion  : 16451 mobile element  : 16253 ALU  : 12513 LINE1  : 2911 SVA  : 829 numt  : 198 duplication  : 664 inversion  : 100 copy number variation  : 7729 >=3 alleles  : 3138 copy number variation  : 3138
========= General summary ==========
no. of observed variants  : 79449759 no. of unclassified variants  : 0

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.