Variant Normalization

From Genome Analysis Wiki
Jump to navigationJump to search

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types with common accompanying statistics such as allele frequency and also the genotypes of individuals. However, variant representation in VCF is non-unique for non SNPs. A failure to recognize this and handling it appropriately will oft times result in inaccurate analyses.

Here, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants. We provide a formal proof of correctness of the procedure.

Normalization

The normalization procedure of a variant representation in VCF can be considered in two parts: parsimony and left alignment. We first describe parsimony followed by left alignment.

Parsimony

Parsimony means doing something in the simplest and most economical way. In the context of variant representation, this means representing a variant in as few nucleotides as possible. Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation. When a variant has superfluous nucleotides on the left side, there is a need to left trim. The concept is symmetric for right trimming.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

Left alignment

Left alignment is associated with Indels and is a slightly tricky concept to understand. We attempt to explain it in the following bullet points that are colored to represent the examples in the following graphic.

  • The representation of Indels in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string (empty allele). The red indel is an illegal VCF representation
  • An indel is left aligned when the variant can no longer be shifted to the left while ensuring that no empty allele is present. The green Indel is not left aligned.
  • The orange indel is left aligned but is not right trimmed.
  • The blue Indel is not left trimmed.
  • The maroon Indel has no superfluous nucleotides and is left aligned.
This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

How to observe that a variant is not left aligned or parsimonious on the right side?

   If the each allele ends with the same nucleotide, it is not left aligned or right parsimonious.

Outline: We break the statement equivalently into 2 parts and prove each individually.

a) If the each allele ends with the same nucleotide, it is not right parsimonious and
b) If the each allele ends with the same nucleotide, it is not left aligned.

Proof of a)

  • True by definition of parsimonious.

Proof of b)

We prove the contrapositive:

  • If an Indel is left aligned then each allele do not end with the same type of nucleotide.

Outline: We first suppose an indel is already left aligned. We then show that if each allele ends with the same type of nucleotide, then it cannot possibly be left aligned leading to a contradiction with the initial assumption that the Indel is left aligned and thus each allele must not end with the same nucleotide.

In order to shift the left aligned variant to the right, we have to be able to truncate the leftmost nucleotide in each allele without any loss of information (i.e. we can reconstruct the original alleles from the non left aligned version of the variant given the reference genome). In order to do this, the first leftmost nucleotide in each allele should be the same as the reference genome nucleotide for that position. The truncation of the leftmost nucleotide may result in an empty allele, we may then extend the rightmost end of each allele by the base observed on the reference sequence. Thus, an indel is not left aligned or right parsimonious if the rightmost nucleotide of each allele is represented by the same type of nucleotide.

Algorithm for Normalization

The algorithm to normalize a variant; biallelic or multiallelic is as follows:

Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.


Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.