Variant classification
From Genome Analysis Wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Introduction
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences.
On this wiki page, we describe a a variant classification system for VCF variants.
Definitions
The definition of a variant is based on the definition of each allele with respect to the reference sequence. We consider 5 major types as follows.
- 1. SNP
- The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.
- 2. MNP
- a.The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another.
- OR
- b. all reference and alternate sequences have the same length.
- 3. INDEL
- a. The reference and alternate sequence are not the same length.
- AND
- b. The removal of a subsequence of the longer sequence would reduce the longer sequence to the smaller sequence.
- 4. CLUMPED
- 5. SV
- The alternate sequence is represented by a angled bracket tag.
Simple Biallelic Examples
SNP
REF A ALT G
MNP
REF AT ALT GC
INDEL
REF AT ALT A
SV
REF A ALT <DEL>
Complex Biallelic Examples
SNP|INDEL
REF AT ALT G
MNP|INDEL
REF ATT ALT GC
MNP|CLUMPED
REF ATTTT ALT GTTTC
INDEL|CLUMPED
REF ATTTTTTTT ALT GTTTC
Simple Multiallelic Examples
SNP
REF A ALT G ALT C
MNP
REF AG ALT GC ALT CT
INDEL
REF ATTT ALT ATT ALT AT
Complex Multiallelic Examples
SNP|MNP
REF AT ALT GT #SNP ALT AC #SNP #since all the alleles are of the sample length, classified as MNP too.
SNP|MNP|INDEL
REF GT ALT CT #SNP ALT AG #MNP ALT GTT #INDEL
SNP|MNP|INDEL|CLUMPED
REF GTTT ALT CG #MNP|INDEL ALT AG #MNP ALT GTGTG #SNP|INDEL|CLUMPED
SV Examples
======= Structural variants ======== no. of structural variants : 41217 2 alleles : 38079 deletion : 13135 <DEL> insertion : 16451 <INS> mobile element : 16253 <INS:ME> ALU : 12513 <INS:ME:ALU> LINE1 : 2911 <INS:ME:LINE1> SVA : 829 <INS:ME:SVA> numt : 198 <INS:MT> duplication : 664 <DUP> inversion : 100 <INV> copy number variation : 7729 <CN4> >=3 alleles : 3138 copy number variation : 3138 <CN4>,<CN8>
Output
Summarizes the variants in a VCF file
#summarizes the variants found in mills.vcf vt peek mills.vcf
usage : vt peek [options] <in.vcf>
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -- ignores the rest of the labeled arguments following this flag -h displays help
#This is a sample output of a peek command which summarizes the variants found in a VCF file. stats: no. of samples : 0 no. of chromosomes : 22
========== Micro variants ==========
no. of SNPs : 77228885 2 alleles (ts/tv) : 77011302 (2.11) [52287790/24723512] 3 alleles (ts/tv) : 216560 (0.75) [185520/247600] 4 alleles (ts/tv) : 1023 (0.50) [1023/2046]
no. of MNPs : 0 2 alleles (ts/tv) : 0 (-nan) [0/0] >=3 alleles (ts/tv) : 0 (-nan) [0/0]
no. Indels : 2147564 2 alleles (ins/del) : 2124842 (0.47) [683250/1441592] >=3 alleles (ins/del) : 22722 (2.12) [32411/15286]
no. SNP/MNP : 0 3 alleles (ts/tv) : 0 (-nan) [0/0] >=4 alleles (ts/tv) : 0 (-nan) [0/0]
no. SNP/Indels : 12913 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 12501 (0.43) [7670/17649] (18.64) [12434/667]
no. MNP/Indels : 153 2 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] >=3 alleles (ts/tv) (ins/del) : 153 (0.30) [138/465] (0.27) [67/248]
no. SNP/MNP/Indels : 2 3 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] 4 alleles (ts/tv) (ins/del) : 2 (0.00) [3/5] (1.00) [3/3] >=5 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]
no. of clumped variants : 19025 2 alleles : 0 (-nan) [0/0] (-nan) [0/0] 3 alleles : 18508 (0.16) [12152/75366] (0.00) [93/18653] 4 alleles : 451 (0.15) [369/2390] (0.33) [201/609] >=5 alleles : 66 (0.09) [37/414] (1.19) [107/90]
====== Other useful categories =====
no. complex variants : 32093 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 31681 (0.21) [20369/96289] (0.64) [12905/20270]
======= Structural variants ========
no. of structural variants : 41217 2 alleles : 38079 deletion : 13135 insertion : 16451 mobile element : 16253 ALU : 12513 LINE1 : 2911 SVA : 829 numt : 198 duplication : 664 inversion : 100 copy number variation : 7729 >=3 alleles : 3138 copy number variation : 3138
========= General summary ==========
no. of reference : 0
no. of observed variants : 79449759 no. of unclassified variants : 0
Maintained by
This page is maintained by Adrian.