Changes

From Genome Analysis Wiki
Jump to: navigation, search

Variant classification

3,395 bytes added, 20:44, 25 February 2016
Representation of close by variants
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.
On this wiki page, we describe a a variant classification system for VCF variantsentries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs.
= Definitions =
The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely defined decribed as follows.
;1. SNP
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
;2. MNP
: a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
: OR
: b. all All reference and alternate sequences have the same length(this is applicable to all alleles).
;3. INDEL
: a. The reference and alternate sequence sequences are not of the same length.: AND: b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
;4. CLUMPED
: a. A clumping of nearby SNPs, MNPs or Indels.
;5. SV
: The alternate sequence is represented by a an angled bracket tag. = Classification Procedure = #Trim each allele with respect to the reference sequence individually#Inspect length, defined as length of alternate allele minus length of reference allele.##if length = 0###if length(ref) = 1 and nucleotides differ, classify as SNP (count ts and tv too)###if length(ref) > 1 ####if all nucleotides differ, classify as MNP (count ts and tv too)####if not all nucleotides differ, classify as CLUMPED (count ts and tv too)##if length <math>\ne</math> 0, classify as INDEL###if shorter allele is of length 1####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification ###if shorter allele length > 1####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)#####if all nucleotides differ, add MNP classification#####if not all nucleotides differ, add CLUMPED classification#Variant classification is the union of the classifications of each allele present in the variant.#If all alleles are the same length, add MNP classification.
= Examples =
We present the following examples to explain the concepts explained earlierclassification described== Legend for examples ==  &lt;variant classification&gt;<br> REF &lt;reference sequence&gt; ALT &lt;alternative sequence 1&gt; #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt; ALT &lt;alternative sequence 2&gt; #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
== Simple Biallelic Examples ==
SNP<br>
REF A #1 ts ALT G #SNP, 1 ts
MNP<br>
REF AT #2 ts ALT GC #MNP, 2 ts
INDEL<br>
REF AT ALT A #INDEL, 1 del  INDEL<br> REF AT ALT T #INDEL, 1 del #Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a Abase. #If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles. #Simple Indel classification should be invariant whether it is left or right aligned.
SV<br>
REF A
ALT &lt;DEL&gt; #SV
== Complex Biallelic Examples ==
SNP|INDEL<br>
REF AT ALT G #SNP, INDEL, 1 ts ALT #Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually #not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/GSNP which #is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition #and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply #as a complex variant.
MNP|INDEL<br>
REF ATT ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del ALT GG
MNP|CLUMPED<br>
REF ATTTT #2 ts ALT GTTTC #MNP, CLUMPED, 2 ts #since all the alleles are of the sample same length, classified as MNP too.
INDEL|CLUMPED<br>
REF ATTTTTTTT ALT GTTTC #INDEL, CLUMPED, 2 ts, 1 del ALT GTTTC
== Simple Multiallelic Examples ==
SNP<br>
REF A
ALT G #SNP, 1 ts ALT C #SNP, 1 tv
MNP<br>
REF AG
ALT GC #MNP, 1 ts, 1 tv ALT CT #MNP, 2 tv
INDEL<br>
REF ATTT
ALT ATT #INDEL, 1 del ALT ATTTT #INDEL, 1 ins
== Complex Multiallelic Examples ==
SNP|MNP<br>
REF AT
ALT GT #SNP, 1 ts ALT AC #SNP, 1 ts #since all the alleles are of the sample length, classified as MNP too.
SNP|MNP|CLUMPED<br>
REF ATTTG
ALT GTTTC #CLUMPED, 1 ts, 1 tv ALT ATTTC #CLUMPEDSNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP #since all the alleles are of the sample length, classified as MNP too.
SNP|MNP|INDEL<br>
REF GT
ALT CT #SNP, 1 tv ALT AG #MNP, 2 tv ALT GTT #INDEL, 1 ins
SNP|MNP|INDEL|CLUMPED<br>
REF GTTT
ALT CG #MNP|, INDEL, 2 tv, 1 del ALT AG #MNP, INDEL, 1 ts, 1 tv ALT GTGTG #SNP|, INDEL|, CLUMPED, 1 tv, 1 ins
== Structured Variants Examples ==
no. of structural variants : 41217 SV<br> 2 alleles : 38079 REF G deletion : 13135 ALT &lt;DELINS:ME:LINE1&gt; #SV SV<br> REF G insertion : 16451 ALT &lt;INSCN4&gt; #SV mobile element : 16253 ALT &lt;INS:MECN12&gt; #SV =Interesting Variant Types =  Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br>   ALU 20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color: 12513 &lt;INS#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR>  This can be induced as one record considering only the ATTT repeats 20 9538655 <span style="color:ME#FF0000">ATTTATTTATTT </span> <span style="color:ALU&gt;#FF0000">ATTT </span>  one record with CATT repeats LINE1 20 9538695 <span style="color: 2911 &lt;INS#0000FF">CATTCATT </span> <span style="color:ME:LINE1&gt;#0000FF">CATT </span>  one record with a mix of both repeat types SVA 20 9538695 <span style="color: 829 &lt;INS#FF0000">TATT<span style="color:ME#0000FF">CATTCATT </span> <span style="color:SVA&gt;#0000FF">CATT </span> = Representation of close by variants =  numt 1: 198 &lt;INS:MT&gt;124001690 TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG  a single complex variant CHROM POS REF ALT 1 124001690 C AAA duplication : 664 &lt;DUP&gt; inversion : 100 &lt;INV&gt; an Indel and SNP adjacent to one another copy number variation : 7729 &lt;CN4&gt;CHROM POS REF ALT >=3 alleles : 3138 1 124001689 T TAA copy number variation : 3138 &lt;CN4&gt;,&lt;CN8&gt;1 124001690 C A Representing it as a single complex variant enforces that both "indel" and "SNP" are always together.Representing it as 2 separate variants allows both alleles to segregate independently.
= Output =
3 alleles : 273 (0.89) [537/601]
4 alleles : 3 (1.00) [9/9] <br>
no. of Indel : 6600770 #also referred to as simple Indels
2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
3 alleles : 280892 (8.72) [503977/57807]
4 alleles : 34 (1.16) [109/94]
>=5 alleles : 4 (0.76) [13/17] <br>
no. of Complex Substitutions : 159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029]
3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942]
1,102
edits

Navigation menu