Difference between revisions of "Variant classification"
(53 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. | The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. | ||
− | On this wiki page, we describe a a variant classification system for VCF | + | On this wiki page, we describe a a variant classification system for VCF entries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs. |
= Definitions = | = Definitions = | ||
− | The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely | + | The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely decribed as follows. |
;1. SNP | ;1. SNP | ||
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another. | : The reference and alternate sequences are of length 1 and the base nucleotide is different from one another. | ||
;2. MNP | ;2. MNP | ||
− | : | + | : The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another. |
: OR | : OR | ||
− | : | + | : All reference and alternate sequences have the same length (this is applicable to all alleles). |
;3. INDEL | ;3. INDEL | ||
− | : | + | : The reference and alternate sequences are not of the same length. |
− | |||
− | |||
;4. CLUMPED | ;4. CLUMPED | ||
− | : | + | : A clumping of nearby SNPs, MNPs or Indels. |
;5. SV | ;5. SV | ||
− | : The alternate sequence is represented by | + | : The alternate sequence is represented by an angled bracket tag. |
+ | |||
+ | = Classification Procedure = | ||
+ | |||
+ | #Trim each allele with respect to the reference sequence individually | ||
+ | #Inspect length, defined as length of alternate allele minus length of reference allele. | ||
+ | ##if length = 0 | ||
+ | ###if length(ref) = 1 and nucleotides differ, classify as SNP (count ts and tv too) | ||
+ | ###if length(ref) > 1 | ||
+ | ####if all nucleotides differ, classify as MNP (count ts and tv too) | ||
+ | ####if not all nucleotides differ, classify as CLUMPED (count ts and tv too) | ||
+ | ##if length <math>\ne</math> 0, classify as INDEL | ||
+ | ###if shorter allele is of length 1 | ||
+ | ####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification | ||
+ | ###if shorter allele length > 1 | ||
+ | ####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too) | ||
+ | #####if all nucleotides differ, add MNP classification | ||
+ | #####if not all nucleotides differ, add CLUMPED classification | ||
+ | #Variant classification is the union of the classifications of each allele present in the variant. | ||
+ | #If all alleles are the same length, add MNP classification. | ||
= Examples = | = Examples = | ||
− | We present the following examples to explain the | + | We present the following examples to explain the classification described. |
== Legend for examples == | == Legend for examples == | ||
Line 32: | Line 49: | ||
<variant classification><br> | <variant classification><br> | ||
REF <reference sequence> | REF <reference sequence> | ||
− | ALT <alternative sequence> #<allele classification> | + | ALT <alternative sequence 1> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> |
− | ALT <alternative sequence> #< | + | ALT <alternative sequence 2> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> |
== Simple Biallelic Examples == | == Simple Biallelic Examples == | ||
Line 39: | Line 56: | ||
SNP<br> | SNP<br> | ||
REF A | REF A | ||
− | ALT G | + | ALT G #SNP, 1 ts |
MNP<br> | MNP<br> | ||
REF AT | REF AT | ||
− | ALT GC # | + | ALT GC #MNP, 2 ts |
INDEL<br> | INDEL<br> | ||
Line 52: | Line 69: | ||
REF AT | REF AT | ||
ALT T #INDEL, 1 del | ALT T #INDEL, 1 del | ||
+ | #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. | ||
+ | #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles. | ||
+ | #Simple Indel classification should be invariant whether it is left or right aligned. | ||
SV<br> | SV<br> | ||
Line 61: | Line 81: | ||
SNP|INDEL<br> | SNP|INDEL<br> | ||
REF AT | REF AT | ||
− | ALT G #SNP,INDEL, 1 ts | + | ALT G #SNP, INDEL, 1 ts |
+ | #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually | ||
+ | #not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which | ||
+ | #is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition | ||
+ | #and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply | ||
+ | #as a complex variant. | ||
MNP|INDEL<br> | MNP|INDEL<br> | ||
REF ATT | REF ATT | ||
− | ALT GG #MNP,INDEL, ts, 1 tv, 1 del | + | ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del |
MNP|CLUMPED<br> | MNP|CLUMPED<br> | ||
REF ATTTT | REF ATTTT | ||
− | ALT GTTTC #MNP, | + | ALT GTTTC #MNP, CLUMPED, 2 ts |
− | + | #since all the alleles are of the same length, classified as MNP too. | |
INDEL|CLUMPED<br> | INDEL|CLUMPED<br> | ||
Line 80: | Line 105: | ||
SNP<br> | SNP<br> | ||
REF A | REF A | ||
− | ALT G | + | ALT G #SNP, 1 ts |
− | ALT C | + | ALT C #SNP, 1 tv |
MNP<br> | MNP<br> | ||
REF AG | REF AG | ||
− | ALT GC | + | ALT GC #MNP, 1 ts, 1 tv |
− | ALT CT | + | ALT CT #MNP, 2 tv |
INDEL<br> | INDEL<br> | ||
REF ATTT | REF ATTT | ||
− | ALT ATT | + | ALT ATT #INDEL, 1 del |
− | ALT ATTTT | + | ALT ATTTT #INDEL, 1 ins |
== Complex Multiallelic Examples == | == Complex Multiallelic Examples == | ||
Line 97: | Line 122: | ||
SNP|MNP<br> | SNP|MNP<br> | ||
REF AT | REF AT | ||
− | ALT GT | + | ALT GT #SNP, 1 ts |
− | ALT AC | + | ALT AC #SNP, 1 ts |
− | + | #since all the alleles are of the sample length, classified as MNP too. | |
SNP|MNP|CLUMPED<br> | SNP|MNP|CLUMPED<br> | ||
REF ATTTG | REF ATTTG | ||
− | ALT GTTTC | + | ALT GTTTC #CLUMPED, 1 ts, 1 tv |
− | ALT ATTTC | + | ALT ATTTC #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP |
− | + | #since all the alleles are of the sample length, classified as MNP too. | |
SNP|MNP|INDEL<br> | SNP|MNP|INDEL<br> | ||
REF GT | REF GT | ||
− | ALT CT | + | ALT CT #SNP, 1 tv |
− | ALT AG | + | ALT AG #MNP, 2 tv |
− | ALT GTT | + | ALT GTT #INDEL, 1 ins |
SNP|MNP|INDEL|CLUMPED<br> | SNP|MNP|INDEL|CLUMPED<br> | ||
REF GTTT | REF GTTT | ||
− | ALT CG | + | ALT CG #MNP, INDEL, 2 tv, 1 del |
− | ALT AG | + | ALT AG #MNP, INDEL, 1 ts, 1 tv |
− | ALT GTGTG #SNP | + | ALT GTGTG #SNP, INDEL, CLUMPED, 1 tv, 1 ins |
+ | |||
+ | == Structured Variants Examples == | ||
+ | |||
+ | SV<br> | ||
+ | REF G | ||
+ | ALT <INS:ME:LINE1> #SV | ||
+ | |||
+ | SV<br> | ||
+ | REF G | ||
+ | ALT <CN4> #SV | ||
+ | ALT <CN12> #SV | ||
+ | |||
+ | =Interesting Variant Types = | ||
+ | |||
+ | Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br> | ||
+ | |||
+ | |||
+ | 20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR> | ||
+ | |||
+ | This can be induced as | ||
+ | |||
+ | one record considering only the ATTT repeats | ||
+ | 20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span> | ||
+ | |||
+ | one record with CATT repeats | ||
+ | 20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span> | ||
+ | |||
+ | one record with a mix of both repeat types | ||
+ | 20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span> | ||
+ | |||
+ | = Representation of close by variants = | ||
+ | |||
+ | 1:124001690 | ||
+ | TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG | ||
+ | TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG | ||
− | + | a single complex variant | |
+ | CHROM POS REF ALT | ||
+ | 1 124001690 C AAA | ||
− | + | an Indel and SNP adjacent to one another | |
+ | CHROM POS REF ALT | ||
+ | 1 124001689 T TAA | ||
+ | 1 124001690 C A | ||
− | + | Representing it as a single complex variant enforces that both "indel" and "SNP" are always together. | |
− | + | Representing it as 2 separate variants allows both alleles to segregate independently. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Output = | = Output = | ||
Line 154: | Line 207: | ||
3 alleles : 273 (0.89) [537/601] | 3 alleles : 273 (0.89) [537/601] | ||
4 alleles : 3 (1.00) [9/9] <br> | 4 alleles : 3 (1.00) [9/9] <br> | ||
− | no. of Indel : 6600770 | + | no. of Indel : 6600770 #also referred to as simple Indels |
2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts | 2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts | ||
3 alleles : 280892 (8.72) [503977/57807] | 3 alleles : 280892 (8.72) [503977/57807] | ||
Line 208: | Line 261: | ||
4 alleles : 34 (1.16) [109/94] | 4 alleles : 34 (1.16) [109/94] | ||
>=5 alleles : 4 (0.76) [13/17] <br> | >=5 alleles : 4 (0.76) [13/17] <br> | ||
− | no. of Complex Substitutions : 159298 #equivalent to categories not including | + | no. of Complex Substitutions : 159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels |
2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] | 2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] | ||
3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] | 3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] |
Latest revision as of 20:44, 25 February 2016
Introduction
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.
On this wiki page, we describe a a variant classification system for VCF entries that is invariant to normalization except for the case of MNPs.
Definitions
The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely decribed as follows.
- 1. SNP
- The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
- 2. MNP
- The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
- OR
- All reference and alternate sequences have the same length (this is applicable to all alleles).
- 3. INDEL
- The reference and alternate sequences are not of the same length.
- 4. CLUMPED
- A clumping of nearby SNPs, MNPs or Indels.
- 5. SV
- The alternate sequence is represented by an angled bracket tag.
Classification Procedure
- Trim each allele with respect to the reference sequence individually
- Inspect length, defined as length of alternate allele minus length of reference allele.
- if length = 0
- if length(ref) = 1 and nucleotides differ, classify as SNP (count ts and tv too)
- if length(ref) > 1
- if all nucleotides differ, classify as MNP (count ts and tv too)
- if not all nucleotides differ, classify as CLUMPED (count ts and tv too)
- if length 0, classify as INDEL
- if shorter allele is of length 1
- if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification
- if shorter allele length > 1
- compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
- if all nucleotides differ, add MNP classification
- if not all nucleotides differ, add CLUMPED classification
- compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)
- if shorter allele is of length 1
- if length = 0
- Variant classification is the union of the classifications of each allele present in the variant.
- If all alleles are the same length, add MNP classification.
Examples
We present the following examples to explain the classification described.
Legend for examples
<variant classification>
REF <reference sequence> ALT <alternative sequence 1> #<allele classification>, <contribution to transition, transversion, insertion or deletion count> ALT <alternative sequence 2> #<allele classification>, <contribution to transition, transversion, insertion or deletion count>
Simple Biallelic Examples
SNP
REF A ALT G #SNP, 1 ts
MNP
REF AT ALT GC #MNP, 2 ts
INDEL
REF AT ALT A #INDEL, 1 del
INDEL
REF AT ALT T #INDEL, 1 del #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base. #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles. #Simple Indel classification should be invariant whether it is left or right aligned.
SV
REF A ALT <DEL> #SV
Complex Biallelic Examples
SNP|INDEL
REF AT ALT G #SNP, INDEL, 1 ts #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually #not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which #is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition #and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply #as a complex variant.
MNP|INDEL
REF ATT ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del
MNP|CLUMPED
REF ATTTT ALT GTTTC #MNP, CLUMPED, 2 ts #since all the alleles are of the same length, classified as MNP too.
INDEL|CLUMPED
REF ATTTTTTTT ALT GTTTC #INDEL, CLUMPED, 2 ts, 1 del
Simple Multiallelic Examples
SNP
REF A ALT G #SNP, 1 ts ALT C #SNP, 1 tv
MNP
REF AG ALT GC #MNP, 1 ts, 1 tv ALT CT #MNP, 2 tv
INDEL
REF ATTT ALT ATT #INDEL, 1 del ALT ATTTT #INDEL, 1 ins
Complex Multiallelic Examples
SNP|MNP
REF AT ALT GT #SNP, 1 ts ALT AC #SNP, 1 ts #since all the alleles are of the sample length, classified as MNP too.
SNP|MNP|CLUMPED
REF ATTTG ALT GTTTC #CLUMPED, 1 ts, 1 tv ALT ATTTC #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP #since all the alleles are of the sample length, classified as MNP too.
SNP|MNP|INDEL
REF GT ALT CT #SNP, 1 tv ALT AG #MNP, 2 tv ALT GTT #INDEL, 1 ins
SNP|MNP|INDEL|CLUMPED
REF GTTT ALT CG #MNP, INDEL, 2 tv, 1 del ALT AG #MNP, INDEL, 1 ts, 1 tv ALT GTGTG #SNP, INDEL, CLUMPED, 1 tv, 1 ins
Structured Variants Examples
SV
REF G ALT <INS:ME:LINE1> #SV SV
REF G ALT <CN4> #SV ALT <CN12> #SV
Interesting Variant Types
Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel.
20 9538655 ATTTATTTATTTATTTATTTATTTATTTATTTATTTATTCATTCATTCATTCATTCATTCATTC <STR>
This can be induced as one record considering only the ATTT repeats 20 9538655 ATTTATTTATTT ATTT
one record with CATT repeats 20 9538695 CATTCATT CATT
one record with a mix of both repeat types
20 9538695 TATTCATTCATT CATT
Representation of close by variants
1:124001690 TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
a single complex variant CHROM POS REF ALT 1 124001690 C AAA
an Indel and SNP adjacent to one another CHROM POS REF ALT 1 124001689 T TAA 1 124001690 C A
Representing it as a single complex variant enforces that both "indel" and "SNP" are always together. Representing it as 2 separate variants allows both alleles to segregate independently.
Output
This is the annotated output of peek in the vt suite.
stats:no. of samples : 0 #number of genotype fields in VCF file, this is a site list so it is 0 no. of chromosomes : 25 #no. of chromosomes observed in this file.
========== Micro variants ==========
no. of SNP : 54247827 #total number of SNPs 2 alleles : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts 3 alleles : 389190 (0.60) [291224/487156] 4 alleles : 370828 (0.50) [370828/741656] >=5 alleles : 1 (0.33) [1/3]
no. of MNP : 122125 2 alleles : 121849 (1.56) [152383/97816] 3 alleles : 273 (0.89) [537/601] 4 alleles : 3 (1.00) [9/9]
no. of Indel : 6600770 #also referred to as simple Indels 2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts 3 alleles : 280892 (8.72) [503977/57807] 4 alleles : 28245 (131.19) [84094/641] >=5 alleles : 5772 (3847.00) [23082/6]
no. of SNP/MNP : 1161 3 alleles : 1143 (1.57) [1565/994] 4 alleles : 15 (1.36) [34/25] >=5 alleles : 3 (0.67) [8/12]
no. of SNP/Indel : 115153 2 alleles : 42717 (0.65) [16778/25939] (0.57) [15441/27276] #ts/tv and ins/del ratios 3 alleles : 66401 (0.72) [29681/41397] (0.33) [31458/96168] 4 alleles : 4631 (0.55) [2420/4386] (0.25) [2602/10306] >=5 alleles : 1404 (0.62) [1197/1926] (0.10) [513/4989]
no. of MNP/Indel : 15619 2 alleles : 12820 (0.51) [12099/23648] (0.77) [5594/7226] 3 alleles : 2455 (0.40) [1796/4469] (0.45) [1144/2546] 4 alleles : 292 (0.24) [215/891] (1.42) [415/292] >=5 alleles : 52 (0.43) [96/225] (2.47) [126/51]
no. of SNP/MNP/Indel : 273 3 alleles : 167 (0.63) [201/321] (0.38) [70/184] 4 alleles : 85 (0.35) [71/203] (0.28) [31/111] >=5 alleles : 21 (0.35) [24/68] (0.68) [25/37]
no. of MNP/Clumped : 61175 2 alleles : 60617 (1.68) [84410/50220] 3 alleles : 549 (1.23) [1777/1449] 4 alleles : 8 (1.43) [53/37] >=5 alleles : 1 (1.00) [5/5]
no. of SNP/MNP/Clumped : 290 3 alleles : 282 (1.35) [665/494] 4 alleles : 8 (0.57) [13/23]
no. of Indel/Clumped : 27638 2 alleles : 25971 (0.65) [31435/48526] (0.79) [11444/14527] 3 alleles : 1585 (0.74) [3568/4793] (0.87) [1383/1582] 4 alleles : 70 (0.55) [96/175] (1.61) [124/77] >=5 alleles : 12 (0.59) [37/63] (4.71) [33/7]
no. of SNP/Indel/Clumped : 456 3 alleles : 257 (0.84) [332/394] (0.33) [111/340] 4 alleles : 174 (0.38) [105/279] (0.58) [186/321] >=5 alleles : 25 (0.19) [12/63] (0.94) [44/47]
no. of MNP/Indel/Clumped : 153 3 alleles : 138 (0.50) [233/466] (0.84) [102/122] 4 alleles : 12 (0.35) [14/40] (1.42) [17/12] >=5 alleles : 3 (0.64) [7/11] (0.67) [4/6]
no. of SNP/MNP/Indel/Clumped : 6 4 alleles : 1 (3.00) [3/1] (0.00) [0/3] >=5 alleles : 5 (0.62) [8/13] (2.00) [12/6]
no. of Reference : 0
====== Other useful categories =====
no. of Block Substitutions : 184751 #equivalent to categories with allele lengths that are the same. 2 alleles : 182466 (1.60) [236793/148036] 3 alleles : 2247 (1.28) [4544/3538] 4 alleles : 34 (1.16) [109/94] >=5 alleles : 4 (0.76) [13/17]
no. of Complex Substitutions : 159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels 2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] 3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] 4 alleles : 5265 (0.49) [2924/5975] (0.30) [3375/11122] >=5 alleles : 1522 (0.58) [1381/2369] (0.15) [757/5143]
======= Structural variants ========
no. of structural variants : 41217 2 alleles : 38079 deletion : 13135 insertion : 16451 mobile element : 16253 ALU : 12513 LINE1 : 2911 SVA : 829 numt : 198 duplication : 664 inversion : 100 copy number variation : 7729 >=3 alleles : 3138 copy number variation : 3138
========= General summary ==========
no. of observed variants : 79449759 no. of unclassified variants : 0
Implementation
This is implemented in vt.
Maintained by
This page is maintained by Adrian.