From Genome Analysis Wiki
→Representation of close by variants
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.
On this wiki page, we describe a a variant classification system for VCF
= Definitions =
The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types loosely
defined as follows.
: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
b. all reference and alternate sequences have the same length.
a. The reference and alternate sequence are not the same length . : AND : b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
a. A clumping of nearby SNPs, MNPs or Indels.
: The alternate sequence is represented by
a angled bracket tag .
= Examples =
We present the following examples to explain the
concepts explained earlier.
== Simple Biallelic Examples ==
REF AT ALT
== Complex Biallelic Examples ==
ALT GTTTC #since all the alleles are of the sample length, classified as MNP too.
== Simple Multiallelic Examples ==
ALT G ALT C
ALT GC ALT CT
ALT ATT ALT
== Complex Multiallelic Examples ==
#SNP ALT AC #SNP #since all the alleles are of the sample length, classified as MNP too.
#CLUMPED ALT ATTTC # CLUMPED #since all the alleles are of the sample length, classified as MNP too.
#SNP ALT AG #MNP ALT GTT #INDEL
#MNP |INDEL ALT AG #MNP ALT GTGTG #SNP |INDEL |CLUMPED
= Output =
3 alleles : 273 (0.89) [537/601]
4 alleles : 3 (1.00) [9/9] <br>
no. of Indel : 6600770
2 alleles : 6285861 (0.88) [2937096/3348765] #ins/del ratio and the respective counts
3 alleles : 280892 (8.72) [503977/57807]
>=5 alleles : 21 (0.35) [24/68] (0.68) [25/37]<br>
no. of MNP/Clumped : 61175
2 alleles : 60617 (1.68) [84410/50220
] (-nan) [0/0] 3 alleles : 549 (1.23) [1777/1449 ] (-nan) [0/0] 4 alleles : 8 (1.43) [53/37 ] (-nan) [0/0] >=5 alleles : 1 (1.00) [5/5] (-nan) [0/0] <br>
no. of SNP/MNP/Clumped : 290
3 alleles : 282 (1.35) [665/494
] (-nan) [0/0] 4 alleles : 8 (0.57) [13/23 ] (-nan) [0/0] <br>
no. of Indel/Clumped : 27638
2 alleles : 25971 (0.65) [31435/48526] (0.79) [11444/14527]
====== Other useful categories ===== <br>
no. of Block Substitutions : 184751 #equivalent to categories with allele lengths that are the same.
2 alleles : 182466 (1.60) [236793/148036
] (-nan) [0/0] 3 alleles : 2247 (1.28) [4544/3538 ] (-nan) [0/0] 4 alleles : 34 (1.16) [109/94 ] (-nan) [0/0] >=5 alleles : 4 (0.76) [13/17] (-nan) [0/0] <br> no. of Complex Substitutions : 159298 #equivalent to categories not including simple SNPs, Block Substitutions and Simple Indels
2 alleles : 81508 (0.61) [60312/98113] (0.66) [32479/49029]
3 alleles : 71003 (0.69) [35811/51840] (0.34) [34268/100942]