Difference between revisions of "Variant Normalization"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 37: Line 37:
  
 
Lines 1 to 8 performs the left alignment and ensures parsimonious representation on  the right side.  Lines 9 to 11 ensures parsimonious representation on the left side.  In the case of STRs, we might prefer the repeat units be retained.
 
Lines 1 to 8 performs the left alignment and ensures parsimonious representation on  the right side.  Lines 9 to 11 ensures parsimonious representation on the left side.  In the case of STRs, we might prefer the repeat units be retained.
=== Maintained by ===
+
 
 +
= Maintained by =
  
 
This page is maintained by  [mailto:atks@umich.edu Adrian].
 
This page is maintained by  [mailto:atks@umich.edu Adrian].

Revision as of 09:28, 13 July 2013

Introduction

Variant representation in Variant Call Format is non-unique. We describe a variant normalization here that is parsimonious and left aligned.

Normalization

Normalization of a variant representation is divided into 2 parts, parsimony and left alignment.

Parsimony

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

We would like to represent a variant in as few nucleotides as possible. Taking the example above, the MNP is represented superfluously for the first 3 representations and parsimoniously for the 4th representation. When a variants has superfluous nucleotides on the left side, we refer that as a need to left trim and similarly for right trimming.

Left alignment

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

Left alignment is usually a concept associated with Indels. We define an indel to be left aligned when the variant can not be shifted to the left any further while ensuring that the indel represented is consistent and that no alleles are represented with an empty string (empty allele).

Similarly, an indel representation can be non parsimonious as shown in the figure with the caveat that it is possible for the leftmost nucleotide to be the same in all alleles as it is a requirement that no empty alleles exists.

How to observe that a variant is not left aligned or parsimonious on the right side?

If the ends of each allele is the same nucleotide, it is not left aligned or parsimonious on the right side.

Proof: Suppose an indel is already left aligned. In order to shift the variant to the right, we have to be able to truncate the first leftmost nucleotide in each allele without any loss of information (i.e. we can reconstruct the original alleles from the right aligned version of the variant given the reference genome). In order to guarantee this, the first leftmost nucleotide in each allele should be the same type of nucleotide (in other words, the same as the reference nucleotide). The truncation of the first leftmost nucleotide should not result in any empty allele. To achieve this, we need to first extend the rightmost end of each allele by the base observed on the reference sequence and then attempting to truncate the alleles simultaneously on the leftmost end. Thus, an indel is not left aligned or right parsimonious if the rightmost nucleotide of each allele is represented by the same type of nucleotide.

Algorithm for Normalization

The algorithm to normalize a variant; biallelic or multiallelic is as follows:

Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.


Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.

Maintained by

This page is maintained by Adrian.