Difference between revisions of "Variant Normalization"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 12: Line 12:
 
== Parsimony ==
 
== Parsimony ==
  
Parsimony means doing  something in the simplest and most economical way.  In the context of variant representation, this means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.  Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation.  When a variant has superfluous nucleotides on the left side, it is defined as not being left parsimonious as there is a need to left trim.  The concept is symmetric for right parsimony and trimming.   
+
Parsimony means doing  something in the simplest and most economical way.  In the context of variant representation, this means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.  Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation.  When a variant has superfluous nucleotides on the left side, it is defined as not being left parsimonious as there is a need to left trim.  The concept is symmetric for right parsimony and trimming.  Parsimony applies to Indels too.
  
 
[[Image:normalization_mnp.png|none|700px|This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.]]  
 
[[Image:normalization_mnp.png|none|700px|This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.]]  

Revision as of 15:43, 19 May 2014

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, Indels to Copy Number Variations. However, variant representation in VCF is non-unique for Indels, a failure to recognize this and handling it appropriately will ofttimes result in inaccurate analyses.

Here, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants. We provide a formal proof of correctness of the procedure.

Normalization

The normalization procedure of a variant representation in VCF can be considered in two parts: parsimony and left alignment. We first describe parsimony followed by left alignment.

Parsimony

Parsimony means doing something in the simplest and most economical way. In the context of variant representation, this means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0. Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation. When a variant has superfluous nucleotides on the left side, it is defined as not being left parsimonious as there is a need to left trim. The concept is symmetric for right parsimony and trimming. Parsimony applies to Indels too.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

Left alignment

A variant is left aligned if it is no longer possible to shift its position to the left while keeping the length of all alleles constant.

  • The representation of variants in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string (empty allele). The red indel is an illegal VCF representation.
  • The green variant is not left aligned as you can copy and A nucleotide on the left side of the variant and truncate the A on the right side of the variant.
  • The orange variant is left aligned but is not right parsimonious.
  • The blue variant is not left parsimonious.
  • The maroon variant left aligned and parsimonious.
This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

How to observe that a variant is not left aligned or parsimonious on the right side?

   If each allele ends with the same nucleotide, it is not left aligned or not right parsimonious.

We prove the contrapositive:

  • If an Indel is left aligned and right parsimonious then each allele do not end with the same type of nucleotide.

We first assume an indel is already left aligned and right parsimonious. Suppose all alleles have a length greater than 1, since the indel is right parsimonious, clearly, each allele do not end with the same type of nucleotide. Now, suppose that there exists an allele of length 1 and that all the alleles end with a particular nucleotide say 'A'. This is still considered right parsimonious as there are no superfluous nucleotides to remove without resulting in an empty allele. It is possible to extend all the alleles one position to the left by copying from a nucleotide on the reference genome, so now we have a superfluous nucleotide on the right side and can remove that nucleotide resulting in a new representation that shifts the Indel to the left by one position where one of the alleles is of length one. This is left aligning the Indel and thus there is a contradiction, so each allele cannot end with the same type of nucleotide. This completes the proof.

Algorithm for Normalization

The algorithm to normalize a variant; biallelic or multiallelic is as follows:

Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.


Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.