Changes

From Genome Analysis Wiki
Jump to navigationJump to search
3,979 bytes added ,  09:55, 1 August 2016
Line 1: Line 1:  
= Introduction =
 
= Introduction =
   −
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have their reference and alternate sequence expressed explicitly. a failure to recognize this will frequently result in inaccurate analyses.
+
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations.  However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. A failure to recognize this will frequently result in inaccurate analyses.
    
On this wiki page, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants.  We then provide a formal proof the procedure's correctness.
 
On this wiki page, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants.  We then provide a formal proof the procedure's correctness.
Line 37: Line 37:  
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.  It is a concept associated with insertion and deletion variants and describes specifically the nature of a position of a variant as opposed to its length.  In order to further differentiate left alignment from simply left padding a variant, the definition is as follows:
 
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.  It is a concept associated with insertion and deletion variants and describes specifically the nature of a position of a variant as opposed to its length.  In order to further differentiate left alignment from simply left padding a variant, the definition is as follows:
   −
     A variant is left aligned if it is no longer possible to shift its position  
+
     A variant is left aligned if and only if it is no longer possible to shift its position  
 
     to the left while keeping the length of all its alleles constant.   
 
     to the left while keeping the length of all its alleles constant.   
   Line 88: Line 88:  
:&rArr; A variant is normalized if and only if it is has no superfluous nucleotides on the left side and right parsimonious and left aligned.  (definition of left parsimony) <br>
 
:&rArr; A variant is normalized if and only if it is has no superfluous nucleotides on the left side and right parsimonious and left aligned.  (definition of left parsimony) <br>
 
:&rArr; A variant is normalized if and only if it is has no superfluous nucleotides on the left side and each allele does not end with the same type of nucleotide. (lemma 1) <br>
 
:&rArr; A variant is normalized if and only if it is has no superfluous nucleotides on the left side and each allele does not end with the same type of nucleotide. (lemma 1) <br>
 +
 +
== Uniqueness ==
 +
 +
It is important that the normalization results in a unique representation of the variant.  Before we begin the proof, intuitively, accept that any representation of a variant can be
 +
transformed to another representation by adding nucleotides from the reference sequence to either ends of all  the alleles at the same time or removing equivalent nucleotides from the ends of
 +
all the alleles at the same time.
 +
 +
 +
Now suppose there are 2 normalized variants A and B.  Suppose A is at a different position from B and B is to the right of A (without loss in generality), this is not possible because by the  definition of a normalized variant, it is left aligned,
 +
and if they were at different positions, that means B may be left aligned to A since they represent the same variants leading to a contradiction.  So A and B must be at the same position.
 +
 +
 +
Now, suppose that A and B are at the same position but are of different lengths where B is longer than A (without loss in generality), this is not possible as B is then not parsimonious, so B can be trimmed to the same length as A.
 +
 +
 +
Thus A and B have to be at the same position and have the same length and variant normalization is unique.
    
= Implementation =
 
= Implementation =
Line 180: Line 196:  
|-
 
|-
 
| #normalized after gatk
 
| #normalized after gatk
| 0
+
| 57
 
| 0
 
| 0
 
| 57
 
| 57
| GATK's algorithm is documented to work only for biallelic simple indels.  The 57 variants were either multiallelic or mixed variants (SNP adjacent to indel).  
+
| GATK's algorithm is documented to work only for biallelic simple indels.  The 57 variants were either multiallelic or mixed variants (SNP adjacent to indel). 6 biallelic mixed variants that were not left aligned however turned out to be simple indels after left alignment.
 
|-
 
|-
 
| #normalized after vt
 
| #normalized after vt
Line 201: Line 217:  
*GATK v3.1-1-g07a4bf8
 
*GATK v3.1-1-g07a4bf8
 
*vt normalize v0.5
 
*vt normalize v0.5
 +
 +
= Here is an example where this normalization algorithm fails =
 +
 +
We distinguish the concepts of normalization and decomposition/reconstruction of variants as follows:
 +
 +
 +
Normalization involves reducing representations of a variant to a canonical representation. Normalization can be applied to biallelic variants or multiallelic variants. The problem of normalization is solvable and there exists a unique representation that is left aligned and parsimonious. Mathematical proof is published. [http://bioinformatics.oxfordjournals.org/content/suppl/2015/02/19/btv112.DC1/VtNormApplicationNote_supp_20141113_1346.pdf]
 +
 +
 +
Decomposition of variants involves the breaking down of a variant record into multiple records. It may be done vertically - as in multiallelics becoming biallelics or it can be done horizontally - a cluster of indels and SNPs represented as a complex variant being splitted up into several records. Horizontal decompositions in general do not have a unique solution.  Similarly, reconstruction combines several variant records into a single record and can be done vertically and horizontally too. Vertical decomposition of a multiallelic variant to a set of biallelic records is a many to one function.  Construction of a set of biallelic variants into a multiallelic record is not unique as you need to considered all possible permutations of the haplotypes containing your alleles. 
 +
 +
 +
If your example contains the decomposition or reconstruction of variants, then it is probable that you can find inconsistencies. 
 +
 +
 +
It is important to distinguish the difference between normalization and decomposition/reconstruction.  The notion of normalization implies that a variant can be reduced to a standardized form.  If you were to include decomposition and reconstruction in your notion of normalization, you are  bound to find inconsistencies simply due to the inherent issues of identifiability. 
 +
 +
 +
When performing decomposition and construction, I think the following factors should be considered:
 +
 +
* Are your variants describing just a single individual or a population?
 +
* Are the genotypes (if any) in your individual(s) phased?
 +
 +
Depending on the context, you will obtain different answers.
 +
 +
[https://github.com/atks/vt/issues/16 An example of inconsistent variant representation due to using vt normalize]
 +
 +
= Citation =
 +
 +
[http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. (2015)  Unified Representation of Genetic Variants. Bioinformatics.]
 +
 +
= Translations =
 +
 +
A mandarin translation can be found [http://www.lyon0804.com/fan-yi-variant-normalization.html here]
    
= Maintained by =
 
= Maintained by =
    
This page is maintained by  [mailto:atks@umich.edu Adrian].
 
This page is maintained by  [mailto:atks@umich.edu Adrian].
1,102

edits

Navigation menu