Variant Normalization

From Genome Analysis Wiki
Jump to navigationJump to search

Introduction

The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types with common accompanying statistics such as allele frequency and also the genotypes of individuals. However, variant representation in VCF is non-unique for non SNPs. A failure to recognize this and handling it appropriately will oft times result in inaccurate analyses.

Here, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants. We provide a formal proof of correctness of the procedure.

Normalization

The normalization procedure of a variant representation in VCF can be considered in two parts: parsimony and left alignment. We first describe parsimony followed by left alignment.

Parsimony

Parsimony means doing something in the simplest and most economical way. In the context of variant representation, this means representing a variant in as few nucleotides as possible. Taking the example below, the Multi Nucleotide Polymorphism (MNP) is represented superfluously for the first 3 representations and parsimoniously for the 4th representation. When a variant has superfluous nucleotides on the left side, it is defined as not being left parsimonious there is a need to left trim. The concept is symmetric for right parsimony and trimming.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

This figure shows multiple representations of a MNP. The left shows 4 possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the parsimonious representation of the MNP.

Left alignment

Left alignment is associated with Indels and is a slightly tricky concept to understand. We attempt to explain it in the following bullet points that are colored to represent the examples in the following graphic.

  • The representation of Indels in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string (empty allele). The red indel is an illegal VCF representation
  • An indel is left aligned when the variant can no longer be shifted to the left while ensuring that no empty allele is present and that there is left parsimony. (meaning you cannot just add nucleotides to the left side of a variant and call it left alignment) The green Indel is not left aligned.
  • The orange indel is left aligned but is not right trimmed.
  • The blue Indel is not left trimmed.
  • The maroon Indel has no superfluous nucleotides and is left aligned.
This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

This figure shows multiple representations of a CA tandem repeat. The left shows five possible representations differentiated by color. The right shows the corresponding representation in VCF. The last representation represents the left aligned and parsimonious representation of the STR.

How to observe that a variant is not left aligned or parsimonious on the right side?

   If each allele ends with the same nucleotide, it is not left aligned or not right parsimonious.

We prove the contrapositive:

  • If an Indel is left aligned and right parsimonious then each allele do not end with the same type of nucleotide.

We first assume an indel is already left aligned and right parsimonious. Suppose all alleles have a length greater than 1, since the indel is right parsimonious, clearly, each allele do not end with the same type of nucleotide. Now, suppose that there exists an allele of length 1 and that all the alleles end with a particular nucleotide say 'A'. This is still considered right parsimonious as there are no superfluous nucleotides to remove without resulting in an empty allele. It is possible to extend all the alleles one position to the left by copying from a nucleotide on the reference genome, so now we have a superfluous nucleotide on the right side and can remove that nucleotide resulting in a new representation that shifts the Indel to the left by one position where one of the alleles is of length one. This is left aligning the Indel and thus there is a contradiction, so each allele cannot end with the same type of nucleotide. This completes the proof.

Algorithm for Normalization

The algorithm to normalize a variant; biallelic or multiallelic is as follows:

Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.


Lines 1 to 8 performs the left alignment and ensures parsimonious representation on the right side. Lines 9 to 11 ensures parsimonious representation on the left side. In the case of STRs, we might prefer the repeat units be retained.

Implementation

This is implemented in vt.

Maintained by

This page is maintained by Adrian.