Changes

2,526 bytes added , 09:55, 1 August 2016

Line 92: Line 92:

It is important that the normalization results in a unique representation of the variant. Before we begin the proof, intuitively, accept that any representation of a variant can be

−

transformed to another representation by ~~removing or~~ adding nucleotides from the reference sequence.

+

transformed to another representation by adding nucleotides from the reference sequence to either ends of all the alleles at the same time or removing equivalent nucleotides from the ends of

+

all the alleles at the same time.

+

Now suppose there are 2 normalized variants A and B. Suppose A is at a different position from B and B is to the right of A (without loss in generality), this is not possible because by the definition of a normalized variant, it is left aligned,

and if they were at different positions, that means B may be left aligned to A since they represent the same variants leading to a contradiction. So A and B must be at the same position.

−

Now, suppose that A and B are of different lengths where B is longer than A, ~~then~~ this is not possible as B is then not parsimonious, so B can be trimmed to the same length as A.

+

Now, suppose that A and B are at the same position but are of different lengths where B is longer than A (without loss in generality), this is not possible as B is then not parsimonious, so B can be trimmed to the same length as A.

+

Thus A and B have to be at the same position and have the same length and variant normalization is unique.

Line 213: Line 217:

*GATK v3.1-1-g07a4bf8

*vt normalize v0.5

+

= Here is an example where this normalization algorithm fails =

+

We distinguish the concepts of normalization and decomposition/reconstruction of variants as follows:

+

Normalization involves reducing representations of a variant to a canonical representation. Normalization can be applied to biallelic variants or multiallelic variants. The problem of normalization is solvable and there exists a unique representation that is left aligned and parsimonious. Mathematical proof is published. [http://bioinformatics.oxfordjournals.org/content/suppl/2015/02/19/btv112.DC1/VtNormApplicationNote_supp_20141113_1346.pdf]

+

Decomposition of variants involves the breaking down of a variant record into multiple records. It may be done vertically - as in multiallelics becoming biallelics or it can be done horizontally - a cluster of indels and SNPs represented as a complex variant being splitted up into several records. Horizontal decompositions in general do not have a unique solution. Similarly, reconstruction combines several variant records into a single record and can be done vertically and horizontally too. Vertical decomposition of a multiallelic variant to a set of biallelic records is a many to one function. Construction of a set of biallelic variants into a multiallelic record is not unique as you need to considered all possible permutations of the haplotypes containing your alleles.

+

If your example contains the decomposition or reconstruction of variants, then it is probable that you can find inconsistencies.

+

It is important to distinguish the difference between normalization and decomposition/reconstruction. The notion of normalization implies that a variant can be reduced to a standardized form. If you were to include decomposition and reconstruction in your notion of normalization, you are bound to find inconsistencies simply due to the inherent issues of identifiability.

+

When performing decomposition and construction, I think the following factors should be considered:

+

* Are your variants describing just a single individual or a population?

+

* Are the genotypes (if any) in your individual(s) phased?

+

Depending on the context, you will obtain different answers.

+

[https://github.com/atks/vt/issues/16 An example of inconsistent variant representation due to using vt normalize]

= Citation =

−

[http://bioinformatics.oxfordjournals.org/content/~~early~~/~~2015/02/19~~/~~bioinformatics.btv112.abstract?keytype=ref&ijkey=2kB1TkBGzkoP1gd~~ Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. (2015) Unified Representation of Genetic Variants. Bioinformatics. ~~doi~~: 10.~~1093~~/~~bioinformatics/btv112~~ ]

+

[http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. (2015) Unified Representation of Genetic Variants. Bioinformatics.]

+

= Translations =

+

A mandarin translation can be found [http://www.lyon0804.com/fan-yi-variant-normalization.html here]

= Maintained by =

This page is maintained by [mailto:atks@umich.edu Adrian].

Atks

1,102

edits

Changes

Variant Normalization (view source)

Revision as of 09:55, 1 August 2016

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools