# Introduction

The VCF format encodes genotypes by the index of the enumeration of genotypes given ploidy number and alleles. This allows for direct access to a genotype value within an array particularly when one works with genotype likelihoods.

# Motivation

Plants species exhibit a diverse number of ploidy, for example, the strawberry is an octoploid and the pear is a triploid.

While there are explicit functions that could be googled for handling haploid and diploidy cases. It seems to be difficult to find the closed forms for the general case. This wiki fills in that need.

# The number of genotypes given a ploidy and alleles

{\displaystyle {\begin{aligned}F(P,A)={\binom {P+A-1}{A-1}}\\\end{aligned}}}

where P is the ploidy number and A is the number of alleles.

# Getting the index of a genotype in an enumerated list given a ploidy and alleles

{\displaystyle {\begin{aligned}G(a_{1},..,a_{P})=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{a_{k}-1}}\end{aligned}}}

where P is the number of ploidy, ${\displaystyle a_{1}}$, ${\displaystyle a_{2}}$ .. ${\displaystyle a_{P}}$ are the alleles in numeric encoding (0 to A-1) and are ordered (AB, ABCCCC). For example ACB is not ordered.

This is well defined because:

{\displaystyle {\begin{aligned}{\binom {n}{r}}={\begin{cases}{\frac {n!}{(n-r)!r!}}&,r\leq n,r\geq 0,n\geq 0\\0&{\text{otherwise}}\end{cases}}\end{aligned}}}

Because ${\displaystyle a_{k}}$ may be 0, we will see cases of ${\displaystyle {\binom {k-1}{-1}}}$ when ${\displaystyle a_{k}=0}$. This is alright because of the definition of ${\displaystyle {\binom {n}{r}}}$ which defines this case as 0. But to make it more sensible, we can define the function equivalently as:

{\displaystyle {\begin{aligned}G(a_{1},..,a_{P})=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{k}}\end{aligned}}}

So when ${\displaystyle a_{k}=0}$, the binomial coefficient reads as ${\displaystyle {\binom {k-1}{k}}}$ which equals 0 since there are 0 ways to choose k items from k-1 items.

# Simple cases

Ploidy Alleles Genotypes Index
1 A ${\displaystyle F(1,A)={\binom {1+A-1}{A-1}}=A}$ ${\displaystyle G(a_{1})=F(1,a_{1})=a_{1}}$
2 A ${\displaystyle F(2,A)={\binom {2+A-1}{A-1}}={\binom {A+1}{2}}}$ ${\displaystyle G(a_{1},a_{2})=F(1,a_{1})+F(2,a_{2})=a_{1}+{\binom {a_{2}+1}{2}}}$

# Derivation for counting the number of genotypes

There must always be P observed alleles and there can only be at most A alleles. This can be modeled by P+A-1 points where you choose A-1 points to be dividers that separate the alleles. Thus the number of ways you can observe this is ${\displaystyle {\binom {P+A-1}{A-1}}}$ which is equivalent to ${\displaystyle {\binom {P+A-1}{P}}}$.

# Derivation for getting the index of a genotype in an enumerated list

## Observation of nested patterns

An important observation here is that for the enumeration of A alleles for a given P ploidy, the enumeration of A-1 alleles for P ploidy is a subsequence.

Index A=4,P=3 A=3,P=3 A=2,P=3 A=1,P=3
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
ABD
BBD
ACD
BCD
CCD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC

AAA

AAB
ABB
BBB

AAA

Another important observation here is that for a genotype ${\displaystyle (a_{1},a_{2},..a_{P})}$, The ${\displaystyle F(P,a_{p})}$th genotype to ${\displaystyle (a_{1},a_{2},..a_{P})}$ all end with ${\displaystyle a_{p}}$, this does not help distinguish the order, so we need to only examine the genotype ${\displaystyle (a_{1},...,a_{P-1})}$. The sub genotype ${\displaystyle (a_{1},...,a_{P-1})}$ is also ordered else ${\displaystyle (a_{1},...,a_{P})}$ is not ordered.

The nested genotype sequence is in red. The blue sequence shows the sequence of genotypes enumerated without involving ${\displaystyle a_{P}}$.

Index A=4,P=3 A=4,P=2 A=4,P=1
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
ABD
BBD
ACD
BCD
CCD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
ABD
BBD
ACD
BCD
CCD
BDD
CDD
DDD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
ABD
BBD
ACD
BCD
CCD
BDD
CDD
DDD

## Derivation

${\displaystyle a_{1},...a_{P}}$ is ordered and indexed 0 to A-1. Clearly when P = 1, the enumeration of the genotypes is trivially the same as the allele.

when P > 1,

{\displaystyle {\begin{aligned}G(a_{1},..,a_{P})&=\|\{{\text{genotypes of }}a_{P}{\text{ alleles for ploidy P}}\}\|+G(a_{1},..,a_{P-1})\\&=F(P,a_{P})+G(a_{1},..,a_{P-1})\\&=F(P,a_{P})+F(P-1,a_{P-1})+G(a_{1},..,a_{P-2})\\&=F(P,a_{P})+F(P-1,a_{P-1})+...+F(1,a_{1})\\&=\sum _{k=1}^{P}F(k,a_{k})\\&=\sum _{k=1}^{P}{\binom {k+a_{k}-1}{a_{k}-1}}\end{aligned}}}

Index Iteration 0 Iteration 1 Iteration 2
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC
ABD
BBD
ACD
BCD
CCD

AAA

AAB
ABB
BBB
AAC
ABC
BBC
ACC
BCC
CCC

AA
AB
BB
AC
BC
CC

AA
AB
BB

A
B
C

Function call G(CCD) F(3,3) G(CC) F(2,2) G(C) = F(1, 2)
value returned 10 3 2

# Algorithm for enumerating the genotypes given ploidy and alleles

The below code is for enumerating genotypes and can be used to test the above equations.

   uint32_t no = 0 // some global variable
void print_genotypes(uint32_t A, uint32_t P, std::string genotype)
{
if (genotype.size()==P)
{
std::cerr << no << ") " << genotype << "\n";
++no;
}
else
{
for (uint32_t a=0; a<A; ++a)
{
std::string s(1,(char)(a+65));
s.append(genotype);
print_genotypes(a+1, P, s);
}
}
}


# Acknowledgement

To Petr Danecek for double checking this.