Variant Normalization

From Genome Analysis Wiki
Jump to navigationJump to search

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types with common accompanying statistics such as allele frequency and also the genotypes of individuals. However, variant representation in VCF is non-unique for non SNPs. A failure to recognize this and handling it appropriately will oft times result in inaccurate analyses.

Here, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants. We provide a formal proof of correctness of the procedure.

Normalization

The normalization procedure of a variant representation in VCF can be considered in two parts: parsimony and left alignment. We first describe parsimony followed by left alignment.

Parsimony

Parsimony means doing something in the simplest and most economical way. In the context of variant representation, this means representing a variant in as few nucleotides as possible. Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation. When a variant has superfluous nucleotides on the left side, there is a need to left trim. The concept is symmetric for right trimming.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

Left alignment

Left alignment is associated with Indels and is a slightly tricky concept to understand. We attempt to explain it in the following bullet points that are colored to represent the examples in the following graphic.

  • The representation of Indels in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string (empty allele). The red indel is an illegal VCF representation
  • An indel is left aligned when the variant can no longer be shifted to the left while ensuring that no empty allele is present. The green Indel is not left aligned.
  • The orange indel is left aligned but is not right trimmed.
  • The blue Indel is not left trimmed.
  • The maroon Indel has no superfluos nucleotides and is left aligned.
This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

How to observe that a variant is not left aligned or parsimonious on the right side?

   If the ends of each allele is the same nucleotide, it is not left aligned or parsimonious on the right side.

Proof: Suppose an indel is already left aligned. In order to shift the variant to the right, we have to be able to truncate the first leftmost nucleotide in each allele without any loss of information (i.e. we can reconstruct the original alleles from the right aligned version of the variant given the reference genome). In order to guarantee this, the first leftmost nucleotide in each allele should be the same type of nucleotide (in other words, the same as the reference nucleotide). The truncation of the first leftmost nucleotide should not result in any empty allele. To achieve this, we need to first extend the rightmost end of each allele by the base observed on the reference sequence and then attempting to truncate the alleles simultaneously on the leftmost end. Thus, an indel is not left aligned or right parsimonious if the rightmost nucleotide of each allele is represented by the same type of nucleotide.

Algorithm for Normalization

The algorithm to normalize a variant; biallelic or multiallelic is as follows:

Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.


Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.