Line 3: |
Line 3: |
| The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. | | The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. |
| | | |
− | On this wiki page, we describe a a variant classification system for VCF variants. | + | On this wiki page, we describe a a variant classification system for VCF entries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs. |
| | | |
| = Definitions = | | = Definitions = |
| | | |
− | The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely defined as follows. | + | The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely decribed as follows. |
| | | |
| ;1. SNP | | ;1. SNP |
| : The reference and alternate sequences are of length 1 and the base nucleotide is different from one another. | | : The reference and alternate sequences are of length 1 and the base nucleotide is different from one another. |
| ;2. MNP | | ;2. MNP |
− | : a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another. | + | : The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another. |
| : OR | | : OR |
− | : b. all reference and alternate sequences have the same length. | + | : All reference and alternate sequences have the same length (this is applicable to all alleles). |
| ;3. INDEL | | ;3. INDEL |
− | : a. The reference and alternate sequence are not the same length. | + | : The reference and alternate sequences are not of the same length. |
− | : AND
| |
− | : b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
| |
| ;4. CLUMPED | | ;4. CLUMPED |
− | : a. A clumping of nearby SNPs, MNPs or Indels. | + | : A clumping of nearby SNPs, MNPs or Indels. |
| ;5. SV | | ;5. SV |
− | : The alternate sequence is represented by a angled bracket tag. | + | : The alternate sequence is represented by an angled bracket tag. |
| + | |
| + | = Classification Procedure = |
| + | |
| + | #Trim each allele with respect to the reference sequence individually |
| + | #Inspect length, defined as length of alternate allele minus length of reference allele. |
| + | ##if length = 0 |
| + | ###if length(ref) = 1 and nucleotides differ, classify as SNP (count ts and tv too) |
| + | ###if length(ref) > 1 |
| + | ####if all nucleotides differ, classify as MNP (count ts and tv too) |
| + | ####if not all nucleotides differ, classify as CLUMPED (count ts and tv too) |
| + | ##if length <math>\ne</math> 0, classify as INDEL |
| + | ###if shorter allele is of length 1 |
| + | ####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification |
| + | ###if shorter allele length > 1 |
| + | ####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too) |
| + | #####if all nucleotides differ, add MNP classification |
| + | #####if not all nucleotides differ, add CLUMPED classification |
| + | #Variant classification is the union of the classifications of each allele present in the variant. |
| + | #If all alleles are the same length, add MNP classification. |
| | | |
| = Examples = | | = Examples = |
| | | |
− | We present the following examples to explain the concepts explained earlier. | + | We present the following examples to explain the classification described. |
| + | |
| + | == Legend for examples == |
| + | |
| + | <variant classification><br> |
| + | REF <reference sequence> |
| + | ALT <alternative sequence 1> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> |
| + | ALT <alternative sequence 2> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> |
| | | |
| == Simple Biallelic Examples == | | == Simple Biallelic Examples == |
Line 32: |
Line 56: |
| SNP<br> | | SNP<br> |
| REF A | | REF A |
− | ALT G | + | ALT G #SNP, 1 ts |
| | | |
| MNP<br> | | MNP<br> |
− | REF AT | + | REF AT |
− | ALT GC | + | ALT GC #MNP, 2 ts |
| + | |
| + | INDEL<br> |
| + | REF AT |
| + | ALT A #INDEL, 1 del |
| | | |
| INDEL<br> | | INDEL<br> |
− | REF AT | + | REF AT |
− | ALT A | + | ALT T #INDEL, 1 del |
| + | #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. |
| + | #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles. |
| + | #Simple Indel classification should be invariant whether it is left or right aligned. |
| | | |
| SV<br> | | SV<br> |
| REF A | | REF A |
− | ALT <DEL> | + | ALT <DEL> #SV |
| | | |
| == Complex Biallelic Examples == | | == Complex Biallelic Examples == |
| | | |
| SNP|INDEL<br> | | SNP|INDEL<br> |
− | REF AT | + | REF AT |
− | ALT G | + | ALT G #SNP, INDEL, 1 ts |
| + | #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually |
| + | #not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which |
| + | #is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition |
| + | #and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply |
| + | #as a complex variant. |
| | | |
| MNP|INDEL<br> | | MNP|INDEL<br> |
− | REF ATT | + | REF ATT |
− | ALT GC | + | ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del |
| | | |
| MNP|CLUMPED<br> | | MNP|CLUMPED<br> |
− | REF ATTTT | + | REF ATTTT |
− | ALT GTTTC | + | ALT GTTTC #MNP, CLUMPED, 2 ts |
− | #since all the alleles are of the sample length, classified as MNP too.
| + | #since all the alleles are of the same length, classified as MNP too. |
| | | |
| INDEL|CLUMPED<br> | | INDEL|CLUMPED<br> |
| REF ATTTTTTTT | | REF ATTTTTTTT |
− | ALT GTTTC | + | ALT GTTTC #INDEL, CLUMPED, 2 ts, 1 del |
| | | |
| == Simple Multiallelic Examples == | | == Simple Multiallelic Examples == |
| | | |
| SNP<br> | | SNP<br> |
− | REF A | + | REF A |
− | ALT G | + | ALT G #SNP, 1 ts |
− | ALT C | + | ALT C #SNP, 1 tv |
| | | |
| MNP<br> | | MNP<br> |
− | REF AG | + | REF AG |
− | ALT GC | + | ALT GC #MNP, 1 ts, 1 tv |
− | ALT CT | + | ALT CT #MNP, 2 tv |
| | | |
| INDEL<br> | | INDEL<br> |
| REF ATTT | | REF ATTT |
− | ALT ATT | + | ALT ATT #INDEL, 1 del |
− | ALT AT | + | ALT ATTTT #INDEL, 1 ins |
| | | |
| == Complex Multiallelic Examples == | | == Complex Multiallelic Examples == |
Line 86: |
Line 122: |
| SNP|MNP<br> | | SNP|MNP<br> |
| REF AT | | REF AT |
− | ALT GT #SNP | + | ALT GT #SNP, 1 ts |
− | ALT AC #SNP | + | ALT AC #SNP, 1 ts |
− | #since all the alleles are of the sample length, classified as MNP too.
| + | #since all the alleles are of the sample length, classified as MNP too. |
| | | |
− | MNP|CLUMPED<br> | + | SNP|MNP|CLUMPED<br> |
| REF ATTTG | | REF ATTTG |
− | ALT GTTTC #CLUMPED | + | ALT GTTTC #CLUMPED, 1 ts, 1 tv |
− | ALT ATTTC #CLUMPED | + | ALT ATTTC #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP |
− | #since all the alleles are of the sample length, classified as MNP too.
| + | #since all the alleles are of the sample length, classified as MNP too. |
| | | |
| SNP|MNP|INDEL<br> | | SNP|MNP|INDEL<br> |
| REF GT | | REF GT |
− | ALT CT #SNP | + | ALT CT #SNP, 1 tv |
− | ALT AG #MNP | + | ALT AG #MNP, 2 tv |
− | ALT GTT #INDEL | + | ALT GTT #INDEL, 1 ins |
| | | |
| SNP|MNP|INDEL|CLUMPED<br> | | SNP|MNP|INDEL|CLUMPED<br> |
| REF GTTT | | REF GTTT |
− | ALT CG #MNP|INDEL | + | ALT CG #MNP, INDEL, 2 tv, 1 del |
− | ALT AG #MNP | + | ALT AG #MNP, INDEL, 1 ts, 1 tv |
− | ALT GTGTG #SNP|INDEL|CLUMPED | + | ALT GTGTG #SNP, INDEL, CLUMPED, 1 tv, 1 ins |
| | | |
− | == SV Examples == | + | == Structured Variants Examples == |
| | | |
− | no. of structural variants : 41217
| + | SV<br> |
− | 2 alleles : 38079
| + | REF G |
− | deletion : 13135 <DEL>
| + | ALT <INS:ME:LINE1> #SV |
− | insertion : 16451 <INS>
| + | |
− | mobile element : 16253 <INS:ME>
| + | SV<br> |
− | ALU : 12513 <INS:ME:ALU>
| + | REF G |
− | LINE1 : 2911 <INS:ME:LINE1>
| + | ALT <CN4> #SV |
− | SVA : 829 <INS:ME:SVA>
| + | ALT <CN12> #SV |
− | numt : 198 <INS:MT>
| + | |
− | duplication : 664 <DUP>
| + | =Interesting Variant Types = |
− | inversion : 100 <INV>
| + | |
− | copy number variation : 7729 <CN4>
| + | Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br> |
− | >=3 alleles : 3138
| + | |
− | copy number variation : 3138 <CN4>,<CN8>
| + | |
| + | 20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR> |
| + | |
| + | This can be induced as |
| + | |
| + | one record considering only the ATTT repeats |
| + | 20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span> |
| + | |
| + | one record with CATT repeats |
| + | 20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span> |
| + | |
| + | one record with a mix of both repeat types |
| + | 20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span> |
| + | |
| + | = Representation of close by variants = |
| + | |
| + | 1:124001690 |
| + | TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG |
| + | TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG |
| + | |
| + | a single complex variant |
| + | CHROM POS REF ALT |
| + | 1 124001690 C AAA |
| + | |
| + | an Indel and SNP adjacent to one another |
| + | CHROM POS REF ALT |
| + | 1 124001689 T TAA |
| + | 1 124001690 C A |
| + | |
| + | Representing it as a single complex variant enforces that both "indel" and "SNP" are always together. |
| + | Representing it as 2 separate variants allows both alleles to segregate independently. |
| | | |
| = Output = | | = Output = |
Line 134: |
Line 200: |
| no. of SNP : 54247827 #total number of SNPs | | no. of SNP : 54247827 #total number of SNPs |
| 2 alleles : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts | | 2 alleles : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts |
− | 3 alleles : 389190 (0.60) [291224/487156] (-nan) [0/0] | + | 3 alleles : 389190 (0.60) [291224/487156] |
− | 4 alleles : 370828 (0.50) [370828/741656] (-nan) [0/0] | + | 4 alleles : 370828 (0.50) [370828/741656] |
− | >=5 alleles : 1 (0.33) [1/3] (-nan) [0/0] <br> | + | >=5 alleles : 1 (0.33) [1/3] <br> |
| no. of MNP : 122125 | | no. of MNP : 122125 |
| 2 alleles : 121849 (1.56) [152383/97816] | | 2 alleles : 121849 (1.56) [152383/97816] |
| 3 alleles : 273 (0.89) [537/601] | | 3 alleles : 273 (0.89) [537/601] |
| 4 alleles : 3 (1.00) [9/9] <br> | | 4 alleles : 3 (1.00) [9/9] <br> |
− | no. of Indel : 6600770 | + | no. of Indel : 6600770 #also referred to as simple Indels |
| 2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts | | 2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts |
| 3 alleles : 280892 (8.72) [503977/57807] | | 3 alleles : 280892 (8.72) [503977/57807] |
Line 165: |
Line 231: |
| >=5 alleles : 21 (0.35) [24/68] (0.68) [25/37]<br> | | >=5 alleles : 21 (0.35) [24/68] (0.68) [25/37]<br> |
| no. of MNP/Clumped : 61175 | | no. of MNP/Clumped : 61175 |
− | 2 alleles : 60617 (1.68) [84410/50220] (-nan) [0/0] | + | 2 alleles : 60617 (1.68) [84410/50220] |
− | 3 alleles : 549 (1.23) [1777/1449] (-nan) [0/0] | + | 3 alleles : 549 (1.23) [1777/1449] |
− | 4 alleles : 8 (1.43) [53/37] (-nan) [0/0] | + | 4 alleles : 8 (1.43) [53/37] |
− | >=5 alleles : 1 (1.00) [5/5] (-nan) [0/0] <br> | + | >=5 alleles : 1 (1.00) [5/5] <br> |
| no. of SNP/MNP/Clumped : 290 | | no. of SNP/MNP/Clumped : 290 |
− | 3 alleles : 282 (1.35) [665/494] (-nan) [0/0] | + | 3 alleles : 282 (1.35) [665/494] |
− | 4 alleles : 8 (0.57) [13/23] (-nan) [0/0] <br> | + | 4 alleles : 8 (0.57) [13/23] <br> |
| no. of Indel/Clumped : 27638 | | no. of Indel/Clumped : 27638 |
| 2 alleles : 25971 (0.65) [31435/48526] (0.79) [11444/14527] | | 2 alleles : 25971 (0.65) [31435/48526] (0.79) [11444/14527] |
Line 191: |
Line 257: |
| ====== Other useful categories ===== <br> | | ====== Other useful categories ===== <br> |
| no. of Block Substitutions : 184751 #equivalent to categories with allele lengths that are the same. | | no. of Block Substitutions : 184751 #equivalent to categories with allele lengths that are the same. |
− | 2 alleles : 182466 (1.60) [236793/148036] (-nan) [0/0] | + | 2 alleles : 182466 (1.60) [236793/148036] |
− | 3 alleles : 2247 (1.28) [4544/3538] (-nan) [0/0] | + | 3 alleles : 2247 (1.28) [4544/3538] |
− | 4 alleles : 34 (1.16) [109/94] (-nan) [0/0] | + | 4 alleles : 34 (1.16) [109/94] |
− | >=5 alleles : 4 (0.76) [13/17] (-nan) [0/0] <br> | + | >=5 alleles : 4 (0.76) [13/17] <br> |
− | no. of Complex Substitutions : 159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels | + | no. of Complex Substitutions : 159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels |
| 2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] | | 2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] |
| 3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] | | 3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] |
| 4 alleles : 5265 (0.49) [2924/5975] (0.30) [3375/11122] | | 4 alleles : 5265 (0.49) [2924/5975] (0.30) [3375/11122] |
| >=5 alleles : 1522 (0.58) [1381/2369] (0.15) [757/5143] <br> | | >=5 alleles : 1522 (0.58) [1381/2369] (0.15) [757/5143] <br> |
− | ======= Structural variants ========<br>
| + | ======= Structural variants ========<br> |
− | no. of structural variants : 41217
| + | no. of structural variants : 41217 |
− | 2 alleles : 38079
| + | 2 alleles : 38079 |
− | deletion : 13135
| + | deletion : 13135 |
− | insertion : 16451
| + | insertion : 16451 |
− | mobile element : 16253
| + | mobile element : 16253 |
− | ALU : 12513
| + | ALU : 12513 |
− | LINE1 : 2911
| + | LINE1 : 2911 |
− | SVA : 829
| + | SVA : 829 |
− | numt : 198
| + | numt : 198 |
− | duplication : 664
| + | duplication : 664 |
− | inversion : 100
| + | inversion : 100 |
− | copy number variation : 7729
| + | copy number variation : 7729 |
− | >=3 alleles : 3138
| + | >=3 alleles : 3138 |
− | copy number variation : 3138 <br>
| + | copy number variation : 3138 <br> |
− | ========= General summary ========== <br>
| + | ========= General summary ========== <br> |
− | no. of reference : 0 <br>
| + | no. of observed variants : 79449759 |
− | no. of observed variants : 79449759
| + | no. of unclassified variants : 0 |
− | no. of unclassified variants : 0
| + | |
| + | = Implementation = |
| + | |
| + | This is implemented in [http://genome.sph.umich.edu/wiki/Vt#Peek vt]. |
| | | |
| = Maintained by = | | = Maintained by = |
| | | |
| This page is maintained by [mailto:atks@umich.edu Adrian]. | | This page is maintained by [mailto:atks@umich.edu Adrian]. |