Changes

From Genome Analysis Wiki
Jump to: navigation, search

Variant classification

8,929 bytes added, 20:44, 25 February 2016
Representation of close by variants
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.
On this wiki page, we describe a a variant classification system for VCF variantsentries that is invariant to [http://genome.sph.umich.edu/wiki/Variant_Normalization normalization] except for the case of MNPs.
= Definitions =
The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely decribed as follows.
;1. SNP
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
;2. MNP
: a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
: OR
: b. all All reference and alternate sequences have the same length(this is applicable to all alleles).
;3. INDEL
: a. The reference and alternate sequence sequences are not of the same length.: AND: b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
;4. CLUMPED
: A clumping of nearby SNPs, MNPs or Indels.
;5. SV
: The alternate sequence is represented by a an angled bracket tag - <DEL>, for example.
= Example Classification Procedure =
REF ALT#Trim each allele with respect to the reference sequence individually #Inspect length, defined as length of alternate allele minus length of reference allele.##if length = 0###if length(ref) = 1 and nucleotides differ, classify as SNP A G (count ts and tv too)###if length(ref) > 1 ####if all nucleotides differ, classify as MNP (count ts and tv too)####if not all nucleotides differ, classify as CLUMPED (count ts and tv too)##if length <math>\ne</math> 0, classify as INDEL###if shorter allele is of length 1####if shorter allele does not match either of the end nucleotides of the longer allele, add SNP classification ###if shorter allele length > 1####compare the shorter allele sequence with the subsequence in the 5' end of the longer allele (count ts and tv too)#####if all nucleotides differ, add MNP classification#####if not all nucleotides differ, add CLUMPED classification#Variant classification is the union of the classifications of each allele present in the variant.#If all alleles are the same length, add MNP classification.
= Examples =
 
We present the following examples to explain the classification described.
 
== Legend for examples ==
 
&lt;variant classification&gt;<br>
REF &lt;reference sequence&gt;
ALT &lt;alternative sequence 1&gt; #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
ALT &lt;alternative sequence 2&gt; #&lt;allele classification&gt;, &lt;contribution to transition, transversion, insertion or deletion count&gt;
 
== Simple Biallelic Examples ==
 
SNP<br>
REF A
ALT G #SNP, 1 ts
 
MNP<br>
REF AT
ALT GC #MNP, 2 ts
 
INDEL<br>
REF AT
ALT A #INDEL, 1 del
 
INDEL<br>
REF AT
ALT T #INDEL, 1 del
#Note that although the padding base differs - A vs T, this is actually a simple indel because it is simply a deletion of a A base.
#If you right align this instead of left aligning, then the padding will be T on both the reference and alternative alleles.
#Simple Indel classification should be invariant whether it is left or right aligned.
 
SV<br>
REF A
ALT &lt;DEL&gt; #SV
 
== Complex Biallelic Examples ==
 
SNP|INDEL<br>
REF AT
ALT G #SNP, INDEL, 1 ts
#Note that it is ambiguous as to which pairing should be a SNP, as such, the transition or transversion contribution is actually
#not defined. In this case, assuming it is a A/G SNP, we get a transition, but we may also consider this as a T/G SNP which
#is a transversion. In such ambiguous cases, we simply consider the aligned bases after left alignment to get the transition
#and transversion contribution. But please be very clear that this is an ambiguous case. It is better to consider this simply
#as a complex variant.
 
MNP|INDEL<br>
REF ATT
ALT GG #MNP, INDEL, 1 ts, 1 tv, 1 del
 
MNP|CLUMPED<br>
REF ATTTT
ALT GTTTC #MNP, CLUMPED, 2 ts
#since all the alleles are of the same length, classified as MNP too.
 
INDEL|CLUMPED<br>
REF ATTTTTTTT
ALT GTTTC #INDEL, CLUMPED, 2 ts, 1 del
 
== Simple Multiallelic Examples ==
 
SNP<br>
REF A
ALT G #SNP, 1 ts
ALT C #SNP, 1 tv
 
MNP<br>
REF AG
ALT GC #MNP, 1 ts, 1 tv
ALT CT #MNP, 2 tv
 
INDEL<br>
REF ATTT
ALT ATT #INDEL, 1 del
ALT ATTTT #INDEL, 1 ins
 
== Complex Multiallelic Examples ==
 
SNP|MNP<br>
REF AT
ALT GT #SNP, 1 ts
ALT AC #SNP, 1 ts
#since all the alleles are of the sample length, classified as MNP too.
 
SNP|MNP|CLUMPED<br>
REF ATTTG
ALT GTTTC #CLUMPED, 1 ts, 1 tv
ALT ATTTC #SNP, 1 tv, note that we get the SNP after truncating the bases ATTT to reveal a G/C transversion SNP
#since all the alleles are of the sample length, classified as MNP too.
 
SNP|MNP|INDEL<br>
REF GT
ALT CT #SNP, 1 tv
ALT AG #MNP, 2 tv
ALT GTT #INDEL, 1 ins
 
SNP|MNP|INDEL|CLUMPED<br>
REF GTTT
ALT CG #MNP, INDEL, 2 tv, 1 del
ALT AG #MNP, INDEL, 1 ts, 1 tv
ALT GTGTG #SNP, INDEL, CLUMPED, 1 tv, 1 ins
 
== Structured Variants Examples ==
 
SV<br>
REF G
ALT &lt;INS:ME:LINE1&gt; #SV
SV<br>
REF G
ALT &lt;CN4&gt; #SV
ALT &lt;CN12&gt; #SV
 
=Interesting Variant Types =
 
Adjacent Tandem Repeats from lobSTR's tandem repeat finder panel. <br>
 
20 9538655 <span style="color:#FF0000">ATTTATTTATTTATTTATTTATTTATTTATTTATTTATT</span><span style="color:#0000FF">CATTCATTCATTCATTCATTCATTC </span> <STR>
 
This can be induced as
one record considering only the ATTT repeats
20 9538655 <span style="color:#FF0000">ATTTATTTATTT </span> <span style="color:#FF0000">ATTT </span>
 
one record with CATT repeats
20 9538695 <span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 
one record with a mix of both repeat types
20 9538695 <span style="color:#FF0000">TATT<span style="color:#0000FF">CATTCATT </span> <span style="color:#0000FF">CATT </span>
 
= Representation of close by variants =
 
1:124001690
TTTCTTT--CAAAAAAAGATAAAAAGGTATTTCATGG
TTTCTTTAAAAAAAAAAGATAAAAAGGAATTTCATGG
 
a single complex variant
CHROM POS REF ALT
1 124001690 C AAA
 
an Indel and SNP adjacent to one another
CHROM POS REF ALT
1 124001689 T TAA
1 124001690 C A
 
Representing it as a single complex variant enforces that both "indel" and "SNP" are always together.
Representing it as 2 separate variants allows both alleles to segregate independently.
= Output =
Summarizes This is the variants annotated output of peek in a VCF file <div class=" mw-collapsible mw-collapsed"> #summarizes the variants found in mills.vcf vt peek millssuite.vcf
stats:no. of samples : 0 #number of genotype fields in VCF file, this is a site list so it is 0 no. of chromosomes : 25 #no. of chromosomes observed in this file.<div classbr> ========== Micro variants ======="mw-collapsible-content"=== <br> no. of SNP : usage 54247827 #total number of SNPs 2 alleles : 53487808 (1.99) [35616038/17871770] #ts/tv ratio and the respective counts 3 alleles : 389190 (0.60) [291224/487156] 4 alleles : 370828 (0.50) [370828/741656] >=5 alleles : 1 (0.33) [1/3] <br> no. of MNP : 122125 2 alleles : 121849 (1.56) [152383/97816] 3 alleles : 273 (0.89) [537/601] 4 alleles : 3 (1.00) [9/9] <br> no. of Indel : 6600770 #also referred to as simple Indels 2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts 3 alleles : 280892 (8.72) [503977/57807] 4 alleles : 28245 (131.19) [84094/641] >=5 alleles : 5772 (3847.00) [23082/6] <br> no. of SNP/MNP : 1161 3 alleles : 1143 (1.57) [1565/994] 4 alleles : 15 (1.36) [34/25] >=5 alleles : 3 (0.67) [8/12] <br> no. of SNP/Indel : 115153 2 alleles : 42717 (0.65) [16778/25939] (0.57) [15441/27276] #ts/tv and ins/del ratios 3 alleles : 66401 (0.72) [29681/41397] (0.33) [31458/96168] 4 alleles : 4631 (0.55) [2420/4386] (0.25) [2602/10306] >=5 alleles : vt peek 1404 (0.62) [1197/1926] (0.10) [options513/4989] <inbr> no. of MNP/Indel : 15619 2 alleles : 12820 (0.51) [12099/23648] (0.77) [5594/7226] 3 alleles : 2455 (0.40) [1796/4469] (0.45) [1144/2546] 4 alleles : 292 (0.24) [215/891] (1.42) [415/292] >=5 alleles : 52 (0.43) [96/225] (2.47) [126/51] <br> no. of SNP/MNP/Indel : 273 3 alleles : 167 (0.63) [201/321] (0.38) [70/184] 4 alleles : 85 (0.35) [71/203] (0.28) [31/111] >=5 alleles : 21 (0.35) [24/68] (0.68) [25/37]<br> no. of MNP/Clumped : 61175 2 alleles : 60617 (1.68) [84410/50220] 3 alleles : 549 (1.23) [1777/1449] 4 alleles : 8 (1.43) [53/37] >=5 alleles : 1 (1.00) [5/5] <br> no. of SNP/MNP/Clumped : 290 3 alleles : 282 (1.35) [665/494] 4 alleles : 8 (0.57) [13/23] <br> no. of Indel/Clumped : 27638 2 alleles : 25971 (0.65) [31435/48526] (0.79) [11444/14527] 3 alleles : 1585 (0.74) [3568/4793] (0.87) [1383/1582] 4 alleles : 70 (0.55) [96/175] (1.61) [124/77] >=5 alleles : 12 (0.59) [37/63] (4.71) [33/7] <br> no. of SNP/Indel/Clumped : 456 3 alleles : 257 (0.84) [332/394] (0.33) [111/340] 4 alleles : 174 (0.38) [105/279] (0.58) [186/321] >=5 alleles : 25 (0.19) [12/63] (0.94) [44/47] <br> no. of MNP/Indel/Clumped : 153 3 alleles : 138 (0.50) [233/466] (0.84) [102/122] 4 alleles : 12 (0.35) [14/40] (1.42) [17/12] >=5 alleles : 3 (0.64) [7/11] (0.67) [4/6] <br> no. of SNP/MNP/Indel/Clumped : 6 4 alleles : 1 (3.00) [3/1] (0.00) [0/3] >=5 alleles : 5 (0.62) [8/13] (2.00) [12/6] <br> no. of Reference : 0 <br> ====== Other useful categories ===== <br> no. of Block Substitutions : 184751 #equivalent to categories with allele lengths that are the same. 2 alleles : 182466 (1.60) [236793/148036] 3 alleles : 2247 (1.28) [4544/3538] 4 alleles : 34 (1.16) [109/94] >=5 alleles : 4 (0.76) [13/17] <br> no. of Complex Substitutions : 159298 #equivalent to categories not including SNPs, Block Substitutions and Simple Indels 2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029] 3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942] 4 alleles : 5265 (0.49) [2924/5975] (0.30) [3375/11122] >=5 alleles : 1522 (0.58) [1381/2369] (0.15) [757/5143] <br> ======= Structural variants ========<br> no.vcfof structural variants : 41217 2 alleles : 38079 deletion : 13135 insertion : 16451 mobile element : 16253 ALU : 12513 LINE1 : 2911 SVA : 829 numt : 198 duplication : 664 inversion : 100 copy number variation : 7729 >=3 alleles : 3138 copy number variation : 3138 <br> ========= General summary ========== <br> no. of observed variants : 79449759 no. of unclassified variants : 0
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -- ignores the rest of the labeled arguments following this flag -h displays help </div></div>= Implementation =
#This is a sample output of a peek command which summarizes the variants found implemented in a VCF file. stats: no. of samples : 0 no. of chromosomes : 22<br> ========== Micro variants ==========<br> no. of SNPs : 77228885 2 alleles (ts/tv) : 77011302 (2.11) [52287790/24723512] 3 alleles (ts/tv) http: 216560 (0.75) [185520/247600] 4 alleles (ts/tv) : 1023 (0genome.50) [1023/2046]<br> nosph. of MNPs : 0 2 alleles (ts/tv) : 0 (-nan) [0/0] >=3 alleles (ts/tv) : 0 (-nan) [0/0]<br> no. Indels : 2147564 2 alleles (ins/del) : 2124842 (0umich.47) [683250edu/1441592] >=3 alleles (inswiki/del) : 22722 (2.12) [32411/15286]<br> no. SNP/MNP : 0 3 alleles (ts/tv) : 0 (-nan) [0/0] >=4 alleles (ts/tv) : 0 (-nan) [0/0] <br> no. SNP/Indels : 12913 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 12501 (0.43) [7670/17649] (18.64) [12434/667]<br> no. MNP/Indels : 153 2 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] >=3 alleles (ts/tv) (ins/del) : 153 (0.30) [138/465] (0.27) [67/248]<br> no. SNP/MNP/Indels : 2 3 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] 4 alleles (ts/tv) (ins/del) : 2 (0.00) [3/5Vt#Peek vt] (1.00) [3/3] >=5 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]<br> no. of clumped variants : 19025 2 alleles : 0 (-nan) [0/0] (-nan) [0/0] 3 alleles : 18508 (0.16) [12152/75366] (0.00) [93/18653] 4 alleles : 451 (0.15) [369/2390] (0.33) [201/609] >=5 alleles : 66 (0.09) [37/414] (1.19) [107/90]<br> ====== Other useful categories =====<br> no. complex variants : 32093 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 31681 (0.21) [20369/96289] (0.64) [12905/20270]<br> ======= Structural variants ========<br> no. of structural variants : 41217 2 alleles : 38079 deletion : 13135 insertion : 16451 mobile element : 16253 ALU : 12513 LINE1 : 2911 SVA : 829 numt : 198 duplication : 664 inversion : 100 copy number variation : 7729 >=3 alleles : 3138 copy number variation : 3138 <br> ========= General summary ========== <br> no. of reference : 0 <br> no. of observed variants : 79449759 no. of unclassified variants : 0
= Maintained by =
This page is maintained by [mailto:atks@umich.edu Adrian].
1,102
edits

Navigation menu