Latest revision as of 21:07, 18 October 2017

Introduction

The VCF format encodes genotypes by the index of the enumeration of genotypes given ploidy number and alleles. This allows for direct access to a value associated with a genotype within an array when one works with genotype likelihoods.

Motivation

Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.

Copy number variations in somatic variant calling also leads to variable ploidy to consider when genotyping a locus.

While there are explicit functions that could be googled for handling haploid and diploid cases. It seems difficult to find the closed forms for the general case. This wiki fills in that need.

The number of genotypes given a ploidy and alleles

${\begin{aligned}F(P,A)={\binom {P+A-1}{A-1}}\\\end{aligned}}$

where P is the ploidy number and A is the number of alleles.

Getting the index of a genotype in an enumerated list given a ploidy and alleles

${\begin{aligned}G(a_{1},..,a_{P})=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{a_{k}-1}}\end{aligned}}$

where P is the number of ploidy, $a_{1}$ , $a_{2}$ .. $a_{P}$ are the alleles in numeric encoding (0 to A-1) and are ordered (e.g. AB and ABCCCC are ordered but ACB is not ordered).

This is well defined because:

${\begin{aligned}{\binom {n}{r}}={\begin{cases}{\frac {n!}{(n-r)!r!}}&,r\leq n,r\geq 0,n\geq 0\\0&{\text{otherwise}}\end{cases}}\end{aligned}}$

Because $a_{k}$ may be 0, we will see cases of ${\binom {k-1}{-1}}$ when $a_{k}=0$ . This is alright because of the definition of ${\binom {n}{r}}$ which defines this case as 0. But to make it more sensible, we can define the function equivalently as:

${\begin{aligned}G(a_{1},..,a_{P})=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{k}}\end{aligned}}$

So when $a_{k}=0$ , the binomial coefficient reads as ${\binom {k-1}{k}}$ which equals 0 since there are 0 ways to choose k items from k-1 items.

Getting the genotypes from a genotype index and a given ploidy

  The genotype index is computed by a summation of a series which is 
  monotonically decreasing.  This allows you to compute the inverse function
  from index to the ordered genotypes by using a "water rapids algorithm" with
  decreasing height of each mini water fall.

 std::vector<int32_t> bcf_ip2g(int32_t genotype_index, uint32_t no_ploidy)
 {
   std::vector<int32_t> genotype(no_ploidy, 0);
   int32_t pth = no_ploidy;
   int32_t max_allele_index = genotype_index;
   int32_t leftover_genotype_index = genotype_index;
   while (pth>0)
   {
       for (int32_t allele_index=0; allele_index <= max_allele_index; ++allele_index)
       {
           int32_t i = choose(pth+allele_index-1, pth);
           if (i>=leftover_genotype_index || allele_index==max_allele_index)
           {
               if (i>leftover_genotype_index) --allele_index;
               leftover_genotype_index -= choose(pth+allele_index-1, pth);
               --pth;
               max_allele_index = allele_index;
               genotype[pth] = allele_index;
               break;                
           }
       }
   }
   return genotype;
 }

 todo:: describe in a human understandable fashion.

Simple cases

Ploidy	Alleles	No. of Genotypes	Index
1	A	$F(1,A)={\binom {1+A-1}{A-1}}=A$	$G(a_{1})=F(1,a_{1})=a_{1}$
2	A	$F(2,A)={\binom {2+A-1}{A-1}}={\binom {A+1}{2}}$	$G(a_{1},a_{2})=F(1,a_{1})+F(2,a_{2})=a_{1}+{\binom {a_{2}+1}{2}}$

Derivation for counting the number of genotypes

There must always be P observed alleles and there can only be at most A alleles. This can be modeled by P+A-1 points where you choose A-1 points to be dividers that separate the alleles. Thus the number of ways you can observe this is ${\binom {P+A-1}{A-1}}$ which is equivalent to ${\binom {P+A-1}{P}}$ .

Derivation for getting the index of a genotype in an enumerated list

Observation of nested patterns

An important observation here is that for the enumeration of A alleles for a given P ploidy, the enumeration of A-1 alleles for P ploidy is a subsequence.

Index	A=4,P=3	A=3,P=3	A=2,P=3	A=1,P=3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC AAD ABD BBD ACD BCD CCD ADD BDD CDD DDD	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC	AAA AAB ABB BBB	AAA

Another important observation here is that for a genotype $(a_{1},a_{2},..a_{P})$ , The $F(P,a_{p})$ th genotype to $(a_{1},a_{2},..a_{P})$ all end with $a_{p}$ , this does not help distinguish the order, so we need to only examine the genotype $(a_{1},...,a_{P-1})$ . The sub genotype $(a_{1},...,a_{P-1})$ is also ordered else $(a_{1},...,a_{P})$ is not ordered.

The nested genotype sequence is in red. The blue sequence shows the sequence of genotypes enumerated without involving $a_{P}$ .

Index	A=4,P=3	A=4,P=2	A=4,P=1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC AAD ABD BBD ACD BCD CCD ADD BDD CDD DDD	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC AAD ABD BBD ACD BCD CCD ADD BDD CDD DDD	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC AAD ABD BBD ACD BCD CCD ADD BDD CDD DDD

The above 2 observations are the key to breaking down the enumeration recursively.

Derivation

$a_{1},...a_{P}$ is ordered and indexed 0 to A-1.

${\begin{aligned}G(a_{1},..,a_{P})&=\|\{{\text{genotypes of }}a_{P}{\text{ alleles for ploidy P}}\}\|+G(a_{1},..,a_{P-1})\\&=F(P,a_{P})+G(a_{1},..,a_{P-1})\\&=F(P,a_{P})+F(P-1,a_{P-1})+G(a_{1},..,a_{P-2})\\&=F(P,a_{P})+F(P-1,a_{P-1})+...+F(1,a_{1})\\&=\sum _{k=1}^{P}F(k,a_{k})\\&=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{a_{k}-1}}\end{aligned}}$

This algorithm is demonstrated in the following table to obtain the index of the genotype C/C/D.

Index	Iteration 0		Iteration 1		Iteration 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC AAD ABD BBD ACD BCD CCD	AAA AAB ABB BBB AAC ABC BBC ACC BCC CCC	AA AB BB AC BC CC	AA AB BB	A B C
Function call	G(CCD)	F(3,3)	G(CC)	F(2,2)	G(C) = F(1, 2)
value returned		10		3	2	index=10+3+2=15 (QED!)

Algorithm for enumerating the genotypes given ploidy and alleles

The below code is for enumerating genotypes and can be used to test the above equations.

   uint32_t no = 0 // some global variable
   void print_genotypes(uint32_t A, uint32_t P, std::string genotype)
   {
       if (genotype.size()==P)
       {
           std::cerr << no << ") " << genotype << "\n";
           ++no;
       }
       else
       {
           for (uint32_t a=0; a<A; ++a)
           {
               std::string s(1,(char)(a+65));
               s.append(genotype);
               print_genotypes(a+1, P, s);
           }
       }
   }

Acknowledgement

To Petr Danecek for double checking this.

Maintained by

This page is maintained by Adrian.

Difference between revisions of "Relationship between Ploidy, Alleles and Genotypes"

Latest revision as of 21:07, 18 October 2017

Contents

Introduction

Motivation

The number of genotypes given a ploidy and alleles

Getting the index of a genotype in an enumerated list given a ploidy and alleles

Getting the genotypes from a genotype index and a given ploidy

Simple cases

Derivation for counting the number of genotypes

Derivation for getting the index of a genotype in an enumerated list

Observation of nested patterns

Derivation

Algorithm for enumerating the genotypes given ploidy and alleles

Acknowledgement

Maintained by

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
 = Introduction =
-The VCF format encodes genotypes by the index of the enumeration of genotypes give a ploidy number and alleles.
+The VCF format encodes genotypes by the index of the enumeration of genotypes given ploidy number and alleles.
-Ploidy and alles are independent of one another while genotypes are a function of them.
+This allows for direct access to a value associated with a genotype within an array when one works with genotype likelihoods.
 = Motivation =
-While there are explicit functions that could be googled for handling haploid and diploidy cases.  It seems to be difficult to find the closed forms for the general case.
+Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.
-This wiki fills in that need.  The cases where one requires such extensions is when pooled samples are studied or when plant species that exhibit a diverse number of
-ploidy.
+Copy number variations in somatic variant calling also leads to variable ploidy to consider when genotyping a locus.
+While there are explicit functions that could be googled for handling haploid and diploid cases.  It seems difficult to find the closed forms for the general case.
+This wiki fills in that need.
 = The number of genotypes given a ploidy and alleles =
@@ Line 14: / Line 17: @@
 <math>
    \begin{align}
-F(P,A) =    \begin{cases}
+F(P,A) =  \binom{P+A-1}{A-1}    \\
-x + 5y +     , A<P= 1 \\
+    \end{align}
-x - 2y + 4z  , A>=P
+</math>
-\end{cases}
+where P is the ploidy number and A is the number of alleles.
+= Getting the index of a genotype  in an enumerated list given a ploidy and alleles =
+<math>
+  \begin{align}
+G(a_1,.. , a_P) =  \sum_{k=1}^P \binom{k+a_k-1}{a_k-1}
     \end{align}
 </math>
-= The indexing of genotypes given a ploidy and alleles =
+where  P is the number of ploidy, <math>a_1</math>, <math>a_2</math> ..  <math>a_P</math> are the alleles in numeric encoding (0 to A-1)  and are ordered (e.g. AB and ABCCCC are ordered but ACB is not ordered).
+This is well defined because:
+<math>
+  \begin{align}
+\binom{n}{r} =  \begin{cases}
+ \frac{n!}{(n-r)!r!}  &, r \le n, r\ge0, n\ge0 \\
+& \text{otherwise}
+\end{cases}
+   \end{align}
+</math>
+Because <math>a_k</math> may be 0, we will see cases of  <math>\binom{k-1}{-1}</math> when  <math>a_k=0</math>.  This is alright because of the definition of  <math>\binom{n}{r}</math> which defines this case as 0.
+But to make it more sensible, we can define the function equivalently as:
+<math>
+  \begin{align}
+G(a_1,.. , a_P) =  \sum_{k=1}^P \binom{k+a_k-1}{k}
+   \end{align}
+</math>
+So when  <math>a_k=0</math>, the binomial coefficient reads as   <math>\binom{k-1}{k}</math> which equals 0 since there are 0 ways to choose k items from k-1 items.
+= Getting the genotypes from a genotype index and a given ploidy =
+   The genotype index is computed by a summation of a series which is
+   monotonically decreasing.  This allows you to compute the inverse function
+   from index to the ordered genotypes by using a "water rapids algorithm" with
+   decreasing height of each mini water fall.
+  std::vector<int32_t> bcf_ip2g(int32_t genotype_index, uint32_t no_ploidy)
+  {
+    std::vector<int32_t> genotype(no_ploidy, 0);
+    int32_t pth = no_ploidy;
+    int32_t max_allele_index = genotype_index;
+    int32_t leftover_genotype_index = genotype_index;
+    while (pth>0)
+    {
+        for (int32_t allele_index=0; allele_index <= max_allele_index; ++allele_index)
+        {
+            int32_t i = choose(pth+allele_index-1, pth);
+            if (i>=leftover_genotype_index || allele_index==max_allele_index)
+            {
+                if (i>leftover_genotype_index) --allele_index;
+                leftover_genotype_index -= choose(pth+allele_index-1, pth);
+                --pth;
+                max_allele_index = allele_index;
+                genotype[pth] = allele_index;
+                break;
+            }
+        }
+    }
+    return genotype;
+  }
+  todo:: describe in a human understandable fashion.
+= Simple cases =
 {| class="wikitable"
 |-
-! scope="col"| Case
+! scope="col"| Ploidy
 ! scope="col"| Alleles
-! scope="col"| Genotypes
+! scope="col"| No. of Genotypes
 ! scope="col"| Index
-! scope="col"| comments
 |-
-| ploidy \le  alleles
+| 1
-| ploidy  alleles
+| A
+| <math>
+F(1, A) =  \binom{1+A-1}{A-1} = A
+</math>
+| <math>
+G(a_1) =   F(1, a_1) =  a_1
+</math>
+|-
+| 2
+| A
+| <math>
+F(2,A) =  \binom{2+A-1}{A-1} =  \binom{A+1}{2}
+</math>
+| <math>
+G(a_1,a_2) =   F(1, a_1)  + F(2, a_2) =  a_1 + \binom{a_2+1}{2}
+</math>
+|}
+= Derivation for counting the number of genotypes =
+There must always be P observed alleles and there can only be at most A alleles.  This can be modeled by P+A-1 points where you choose A-1 points to be dividers that separate the alleles.
+Thus the number of ways you can observe this is <math> \binom{P+A-1}{A-1}  </math> which is equivalent to <math> \binom{P+A-1}{P}  </math>.
+= Derivation for getting the index of a genotype  in an enumerated list =
+== Observation of nested patterns ==
+An important observation here is that for the enumeration of  A alleles for a given P ploidy, the enumeration of A-1 alleles for P ploidy is a subsequence.
+{| class="wikitable"
+|-
+! scope="col"| Index
+! scope="col"| A=4,P=3
+! scope="col"| A=3,P=3
+! scope="col"| A=2,P=3
+! scope="col"| A=1,P=3
+|-
+|0 <br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+| AAA <br>
+AAB <br>
+ABB <br>
+BBB <br>
+AAC <br>
+ABC <br>
+BBC <br>
+ACC <br>
+BCC <br>
+CCC <br>
+AAD <br>
+ABD <br>
+BBD <br>
+ACD <br>
+BCD <br>
+CCD <br>
+ADD <br>
+BDD <br>
+CDD <br>
+DDD <br>
+| AAA <br>
+AAB <br>
+ABB <br>
+BBB <br>
+AAC <br>
+ABC <br>
+BBC <br>
+ACC <br>
+BCC <br>
+CCC <br><br><br><br><br><br><br><br><br><br><br>
+| AAA <br>
+AAB <br>
+ABB <br>
+BBB <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
+| AAA <br>
+<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
+|-
+|}
+Another important observation here is that for a genotype <math>(a_1, a_2, .. a_P)</math>, The <math>F(P, a_p)</math>th genotype to  <math>(a_1, a_2, .. a_P)</math> all end with <math>a_p</math>, this does not help distinguish the order, so we need to only examine the genotype <math>(a_1,..., a_{P-1})</math>. The sub genotype <math>(a_1,..., a_{P-1})</math> is also ordered else <math>(a_1,..., a_{P})</math> is not ordered.
+The nested genotype sequence is in red.  The blue sequence shows the sequence of genotypes enumerated without involving <math>a_P</math>.
+{| class="wikitable"
+|-
+! scope="col"| Index
+! scope="col"| A=4,P=3
+! scope="col"| A=4,P=2
+! scope="col"| A=4,P=1
+|-
+|0 <br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+| <span style="color:#FF0000">AAA</span><br>
+<span style="color:#FF0000">AAB</span> <br>
+<span style="color:#FF0000">ABB</span> <br>
+<span style="color:#FF0000">BBB</span> <br>
+<span style="color:#FF0000">AAC</span> <br>
+<span style="color:#FF0000">ABC</span> <br>
+<span style="color:#FF0000">BBC</span> <br>
+<span style="color:#FF0000">ACC</span> <br>
+<span style="color:#FF0000">BCC</span> <br>
+<span style="color:#FF0000">CCC</span> <br>
+<span style="color:#FF0000">AAD</span> <br>
+<span style="color:#FF0000">ABD</span> <br>
+<span style="color:#FF0000">BBD</span> <br>
+<span style="color:#FF0000">ACD</span> <br>
+<span style="color:#FF0000">BCD</span> <br>
+<span style="color:#FF0000">CCD</span> <br>
+<span style="color:#FF0000">ADD</span> <br>
+<span style="color:#FF0000">BDD</span> <br>
+<span style="color:#FF0000">CDD</span> <br>
+<span style="color:#FF0000">DDD</span> <br>
+|<span style="color:#0000FF">AAA</span><br>
+<span style="color:#0000FF">AAB</span> <br>
+<span style="color:#0000FF">ABB</span> <br>
+<span style="color:#0000FF">BBB</span> <br>
+<span style="color:#0000FF">AAC</span> <br>
+<span style="color:#0000FF">ABC</span> <br>
+<span style="color:#0000FF">BBC</span> <br>
+<span style="color:#0000FF">ACC</span> <br>
+<span style="color:#0000FF">BCC</span> <br>
+<span style="color:#0000FF">CCC</span> <br>
+<span style="color:#FF0000">AA</span>D <br>
+<span style="color:#FF0000">AB</span>D <br>
+<span style="color:#FF0000">BB</span>D <br>
+<span style="color:#FF0000">AC</span>D <br>
+<span style="color:#FF0000">BC</span>D <br>
+<span style="color:#FF0000">CC</span>D <br>
+<span style="color:#FF0000">AD</span>D <br>
+<span style="color:#FF0000">BD</span>D <br>
+<span style="color:#FF0000">CD</span>D <br>
+<span style="color:#FF0000">DD</span>D <br>
+| AAA<br>
+AAB <br>
+ABB <br>
+BBB <br>
+AAC <br>
+ABC <br>
+BBC <br>
+ACC <br>
+BCC <br>
+CCC <br>
+<span style="color:#0000FF">AA</span>D <br>
+<span style="color:#0000FF">AB</span>D <br>
+<span style="color:#0000FF">BB</span>D <br>
+<span style="color:#0000FF">AC</span>D <br>
+<span style="color:#0000FF">BC</span>D <br>
+<span style="color:#0000FF">CC</span>D <br>
+<span style="color:#FF0000">A</span>DD <br>
+<span style="color:#FF0000">B</span>DD <br>
+<span style="color:#FF0000">C</span>DD <br>
+<span style="color:#FF0000">D</span>DD <br>
+|-
+|}
+The above 2 observations are the key to breaking down the enumeration recursively.
+== Derivation ==
+<math>a_1, ... a_P</math> is ordered and indexed 0 to A-1.  <br> <br>
 <math>
    \begin{align}
-P(G_{i,j}|R_{k})^{(l)}=\frac{P(R_{k}|G_{i,j})P(G_{i,j})^{(l-1)}}{\sum_{(i,j)}{P(R_{k}|G_{i,j})P(G_{i,j})^{(l-1)}}}
+G(a_1,.. , a_P) &= \| \{\text{genotypes of } a_P \text{ alleles for ploidy P}\} \| +  G(a_1,.. , a_{P-1})\\
+                       &= F(P, a_P) + G(a_1,..,a_{P-1})  \\
+                       &= F(P, a_P) + F(P-1, a_{P-1}) + G(a_1,..,a_{P-2}) \\
+                       &= F(P, a_P) + F(P-1, a_{P-1}) +  ... + F(1,a_1)  \\
+                       &= \sum_{k=1}^P F(k, a_k) \\
+                       &= \sum_{k=1}^P  \binom{k+a_k-1}{a_k-1}
     \end{align}
 </math>
-| 18794
-| 18849
+<br>
-| bcftools's normalization is buggy, variants were truncated despite having differing prefix.
+This algorithm is demonstrated in the following table to obtain the index of the genotype C/C/D.
+<br>
+{| class="wikitable"
+|-
+! scope="col"| Index
+! scope="col"| Iteration 0
+! scope="col"|
+! scope="col"| Iteration 1
+! scope="col"|
+! scope="col"| Iteration 2
 |-
-| ploidy gt alleles
+|0 <br>
-| -
+<br>
-| -
+<br>
-| -
+<br>
-| -
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<br>
+<span style="color:#FF0000">15</span><br>
+| AAA <br>
+AAB <br>
+ABB <br>
+BBB <br>
+AAC <br>
+ABC <br>
+BBC <br>
+ACC <br>
+BCC <br>
+CCC <br>
+AAD <br>
+ABD <br>
+BBD <br>
+ACD <br>
+BCD <br>
+CCD
+| AAA <br>
+AAB <br>
+ABB <br>
+BBB <br>
+AAC <br>
+ABC <br>
+BBC <br>
+ACC <br>
+BCC <br>
+CCC <br><br><br><br><br><br><br>
+| <br><br><br><br><br><br> <br><br><br><br>
+AA <br>
+AB <br>
+BB <br>
+AC <br>
+BC <br>
+CC <br>
+| <br><br><br><br><br><br> <br><br><br><br>
+AA <br>
+AB <br>
+BB <br>
+<br><br><br>
+| <br><br><br><br><br><br> <br><br><br><br><br><br><br>
+A <br>
+B <br>
+C <br>
 |-
-| #normalized after gatk
+|Function call
-| -
+|G(CCD)
-| 0
+|F(3,3)
-| 57
+|G(CC)
-| 57 variants from GATK's normalization were left aligned by vt.  6 were biallelic and 51 were multiallelic. Note that 2 variants were changed by GATK but were not completely normalized.
+|F(2,2)
+|G(C) = F(1, 2)
+|-
+|value returned
+|
+|10
+|
+|3
+|2
+|index=10+3+2=<span style="color:#FF0000">15</span> (QED!)
 |-
-| #normalized after vt
-| -
-| 0
-| 0
-| no variants processed by vt were further normalized.
 |}
+= Algorithm for enumerating the genotypes given ploidy and alleles =
+The below code is for enumerating genotypes and can be used to test the above equations.
+    uint32_t no = 0 // some global variable
+    void print_genotypes(uint32_t A, uint32_t P, std::string genotype)
+    {
+        if (genotype.size()==P)
+        {
+            std::cerr << no << ") " << genotype << "\n";
+            ++no;
+        }
+        else
+        {
+            for (uint32_t a=0; a<A; ++a)
+            {
+                std::string s(1,(char)(a+65));
+                s.append(genotype);
+                print_genotypes(a+1, P, s);
+            }
+        }
+    }
+= Acknowledgement =
+To [mailto:pd3@sanger.ac.uk Petr Danecek] for double checking this.
+= Maintained by =
+This page is maintained by  [mailto:atks@umich.edu Adrian].