Difference between revisions of "Relationship between Ploidy, Alleles and Genotypes"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(13 intermediate revisions by the same user not shown)
Line 7: Line 7:
  
 
Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.  
 
Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.  
 +
 +
Copy number variations in somatic variant calling also leads to variable ploidy to consider when genotyping a locus.
  
 
While there are explicit functions that could be googled for handling haploid and diploid cases.  It seems difficult to find the closed forms for the general case.
 
While there are explicit functions that could be googled for handling haploid and diploid cases.  It seems difficult to find the closed forms for the general case.
Line 52: Line 54:
  
 
So when  <math>a_k=0</math>, the binomial coefficient reads as  <math>\binom{k-1}{k}</math> which equals 0 since there are 0 ways to choose k items from k-1 items.
 
So when  <math>a_k=0</math>, the binomial coefficient reads as  <math>\binom{k-1}{k}</math> which equals 0 since there are 0 ways to choose k items from k-1 items.
 +
 +
= Getting the genotypes from a genotype index and a given ploidy =
 +
 +
  The genotype index is computed by a summation of a series which is
 +
  monotonically decreasing.  This allows you to compute the inverse function
 +
  from index to the ordered genotypes by using a "water rapids algorithm" with
 +
  decreasing height of each mini water fall.
 +
 +
  std::vector<int32_t> bcf_ip2g(int32_t genotype_index, uint32_t no_ploidy)
 +
  {
 +
    std::vector<int32_t> genotype(no_ploidy, 0);
 +
    int32_t pth = no_ploidy;
 +
    int32_t max_allele_index = genotype_index;
 +
    int32_t leftover_genotype_index = genotype_index;
 +
    while (pth>0)
 +
    {
 +
        for (int32_t allele_index=0; allele_index <= max_allele_index; ++allele_index)
 +
        {
 +
            int32_t i = choose(pth+allele_index-1, pth);
 +
            if (i>=leftover_genotype_index || allele_index==max_allele_index)
 +
            {
 +
                if (i>leftover_genotype_index) --allele_index;
 +
                leftover_genotype_index -= choose(pth+allele_index-1, pth);
 +
                --pth;
 +
                max_allele_index = allele_index;
 +
                genotype[pth] = allele_index;
 +
                break;               
 +
            }
 +
        }
 +
    }
 +
    return genotype;
 +
  }
 +
 +
  todo:: describe in a human understandable fashion.
  
 
= Simple cases =
 
= Simple cases =
Line 256: Line 292:
 
== Derivation ==
 
== Derivation ==
  
<math>a_1, ... a_P</math> is ordered and indexed 0 to A-1.   
+
<math>a_1, ... a_P</math> is ordered and indexed 0 to A-1.  <br> <br>
 
<math>
 
<math>
 
   \begin{align}
 
   \begin{align}
Line 267: Line 303:
 
   \end{align}
 
   \end{align}
 
</math>
 
</math>
 +
 +
<br>
 +
 +
This algorithm is demonstrated in the following table to obtain the index of the genotype C/C/D.
 +
 +
<br>
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 292: Line 334:
 
13 <br>
 
13 <br>
 
14 <br>
 
14 <br>
15 <br>
+
<span style="color:#FF0000">15</span><br>  
 
| AAA <br>
 
| AAA <br>
 
AAB <br>
 
AAB <br>
Line 349: Line 391:
 
|3
 
|3
 
|2
 
|2
 +
|index=10+3+2=<span style="color:#FF0000">15</span> (QED!)
 
|-
 
|-
 
|}
 
|}

Latest revision as of 21:07, 18 October 2017

Introduction

The VCF format encodes genotypes by the index of the enumeration of genotypes given ploidy number and alleles. This allows for direct access to a value associated with a genotype within an array when one works with genotype likelihoods.

Motivation

Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.

Copy number variations in somatic variant calling also leads to variable ploidy to consider when genotyping a locus.

While there are explicit functions that could be googled for handling haploid and diploid cases. It seems difficult to find the closed forms for the general case. This wiki fills in that need.

The number of genotypes given a ploidy and alleles

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} F(P,A) = \binom{P+A-1}{A-1} \\ \end{align} }

where P is the ploidy number and A is the number of alleles.

Getting the index of a genotype in an enumerated list given a ploidy and alleles

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} G(a_1,.. , a_P) = \sum_{k=1}^P \binom{k+a_k-1}{a_k-1} \end{align} }

where P is the number of ploidy, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_1} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_2} .. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_P} are the alleles in numeric encoding (0 to A-1) and are ordered (e.g. AB and ABCCCC are ordered but ACB is not ordered).

This is well defined because:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \binom{n}{r} = \begin{cases} \frac{n!}{(n-r)!r!} &, r \le n, r\ge0, n\ge0 \\ 0 & \text{otherwise} \end{cases} \end{align} }

Because Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_k} may be 0, we will see cases of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \binom{k-1}{-1}} when Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_k=0} . This is alright because of the definition of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \binom{n}{r}} which defines this case as 0. But to make it more sensible, we can define the function equivalently as:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} G(a_1,.. , a_P) = \sum_{k=1}^P \binom{k+a_k-1}{k} \end{align} }

So when , the binomial coefficient reads as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \binom{k-1}{k}} which equals 0 since there are 0 ways to choose k items from k-1 items.

Getting the genotypes from a genotype index and a given ploidy

  The genotype index is computed by a summation of a series which is 
  monotonically decreasing.  This allows you to compute the inverse function
  from index to the ordered genotypes by using a "water rapids algorithm" with
  decreasing height of each mini water fall.

 std::vector<int32_t> bcf_ip2g(int32_t genotype_index, uint32_t no_ploidy)
 {
   std::vector<int32_t> genotype(no_ploidy, 0);
   int32_t pth = no_ploidy;
   int32_t max_allele_index = genotype_index;
   int32_t leftover_genotype_index = genotype_index;
   while (pth>0)
   {
       for (int32_t allele_index=0; allele_index <= max_allele_index; ++allele_index)
       {
           int32_t i = choose(pth+allele_index-1, pth);
           if (i>=leftover_genotype_index || allele_index==max_allele_index)
           {
               if (i>leftover_genotype_index) --allele_index;
               leftover_genotype_index -= choose(pth+allele_index-1, pth);
               --pth;
               max_allele_index = allele_index;
               genotype[pth] = allele_index;
               break;                
           }
       }
   }
   return genotype;
 }
 todo:: describe in a human understandable fashion.

Simple cases

Ploidy Alleles No. of Genotypes Index
1 A Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F(1, A) = \binom{1+A-1}{A-1} = A } Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G(a_1) = F(1, a_1) = a_1 }
2 A Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F(2,A) = \binom{2+A-1}{A-1} = \binom{A+1}{2} } Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G(a_1,a_2) = F(1, a_1) + F(2, a_2) = a_1 + \binom{a_2+1}{2} }

Derivation for counting the number of genotypes

There must always be P observed alleles and there can only be at most A alleles. This can be modeled by P+A-1 points where you choose A-1 points to be dividers that separate the alleles. Thus the number of ways you can observe this is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \binom{P+A-1}{A-1} } which is equivalent to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \binom{P+A-1}{P} } .

Derivation for getting the index of a genotype in an enumerated list

Observation of nested patterns

An important observation here is that for the enumeration of A alleles for a given P ploidy, the enumeration of A-1 alleles for P ploidy is a subsequence.

Index A=4,P=3 A=3,P=3 A=2,P=3 A=1,P=3
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
AAD
ABD
BBD
ACD
BCD
CCD
ADD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC










AAA

AAB
ABB
BBB
















AAA




















Another important observation here is that for a genotype Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (a_1, a_2, .. a_P)} , The Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F(P, a_p)} th genotype to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (a_1, a_2, .. a_P)} all end with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_p} , this does not help distinguish the order, so we need to only examine the genotype Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (a_1,..., a_{P-1})} . The sub genotype Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (a_1,..., a_{P-1})} is also ordered else Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (a_1,..., a_{P})} is not ordered.

The nested genotype sequence is in red. The blue sequence shows the sequence of genotypes enumerated without involving Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_P} .

Index A=4,P=3 A=4,P=2 A=4,P=1
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
AAD
ABD
BBD
ACD
BCD
CCD
ADD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
AAD
ABD
BBD
ACD
BCD
CCD
ADD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
AAD
ABD
BBD
ACD
BCD
CCD
ADD
BDD
CDD
DDD

The above 2 observations are the key to breaking down the enumeration recursively.

Derivation

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a_1, ... a_P} is ordered and indexed 0 to A-1.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} G(a_1,.. , a_P) &= \| \{\text{genotypes of } a_P \text{ alleles for ploidy P}\} \| + G(a_1,.. , a_{P-1})\\ &= F(P, a_P) + G(a_1,..,a_{P-1}) \\ &= F(P, a_P) + F(P-1, a_{P-1}) + G(a_1,..,a_{P-2}) \\ &= F(P, a_P) + F(P-1, a_{P-1}) + ... + F(1,a_1) \\ &= \sum_{k=1}^P F(k, a_k) \\ &= \sum_{k=1}^P \binom{k+a_k-1}{a_k-1} \end{align} }


This algorithm is demonstrated in the following table to obtain the index of the genotype C/C/D.


Index Iteration 0 Iteration 1 Iteration 2
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
AAD
ABD
BBD
ACD
BCD
CCD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
















AA
AB
BB
AC
BC
CC











AA
AB
BB
















A
B
C

Function call G(CCD) F(3,3) G(CC) F(2,2) G(C) = F(1, 2)
value returned 10 3 2 index=10+3+2=15 (QED!)

Algorithm for enumerating the genotypes given ploidy and alleles

The below code is for enumerating genotypes and can be used to test the above equations.

   uint32_t no = 0 // some global variable
   void print_genotypes(uint32_t A, uint32_t P, std::string genotype)
   {
       if (genotype.size()==P)
       {
           std::cerr << no << ") " << genotype << "\n";
           ++no;
       }
       else
       {
           for (uint32_t a=0; a<A; ++a)
           {
               std::string s(1,(char)(a+65));
               s.append(genotype);
               print_genotypes(a+1, P, s);
           }
       }
   }

Acknowledgement

To Petr Danecek for double checking this.

Maintained by

This page is maintained by Adrian.