Genotype Likelihood based Allele Frequency

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

Allele frequency is an important statistic in the study of genetic variants. This page details EM algorithms to estimate allele frequencies from genotype likelihoods in NGS data.

Estimation of Genotype Frequencies without assuming HWE

This is an EM algorithm to estimate the genotype frequencies without assuming HWE. The posterior probability of the genotype given the reads for individual k ( $R_{k}$ ) for the $l$ th iteration is given by:

${\begin{aligned}P(G_{i,j}|R_{k})^{(l)}={\frac {P(R_{k}|G_{i,j})P(G_{i,j})^{(l-1)}}{\sum _{(i,j)}{P(R_{k}|G_{i,j})P(G_{i,j})^{(l-1)}}}}\end{aligned}}$

where $G_{i,j}$ denotes the genotype composed of alleles $i$ and $j$ . $k$ indexes the individuals from $1$ to $N$ . The initial genotype probability is given by:

${\begin{aligned}P(G_{i,j})^{(0)}=f_{i,j}^{(0)}={\frac {2}{n(n+1)}}\end{aligned}}$

where $n$ is the number of alleles. ${\frac {n(n+1)}{2}}$ is the number of genotypes possible for $n$ alleles. So we are simply starting with equal frequency estimate guesses for each genotype.

The E step equates the expectation of the genotype $G_{i,j}$ for individual k as:

${\begin{aligned}E[G_{i,j}|R_{k}]^{(l)}=P(G_{i,j}|R_{k})^{(l)}\end{aligned}}$

The M step estimates the genotype frequency using the individual expected genotype counts:

${\begin{aligned}P(G_{i,j})^{(l)}=f_{i,j}^{(l)}={\frac {1}{N}}\sum _{k}{E[G_{i,j}|R_{k}]}^{(l)}\end{aligned}}$

This is repeated till the appropriate convergence criteria is achieved.

Estimation of Genotype Frequencies assuming HWE

In order to estimate allele frequencies under HWE assumption, the E step estimates the individual expected posterior allele count for each individual.

${\begin{aligned}E[I|R_{k}]^{(l)}=P(G_{i,i}|R_{k})^{(l)}+0.5P(G_{i,j}|R_{k})^{(l)}\end{aligned}}$

In the M step, the posterior genotype frequencies are derived from the computed genotype allele frequencies obtained in the E step assuming HWE.

${\begin{aligned}P(I)^{(l)}={\frac {1}{N}}\sum _{k}{E[I|R_{k}]}^{(l)}\end{aligned}}$

$P(G_{i,j})^{(l)}={\begin{cases}(P(I)^{(l)})^{2},&{\text{if }}i=j\\2P(I)^{(l)}P(J)^{(l)},&{\text{if }}i\neq j\end{cases}}$

This is repeated till the appropriate convergence criteria is achieved.

Used in

Hardy-Weinberg Likelihood Test statistic and Inbreeding Coefficient

Derivation

Adrian with much help from Hyun.

Maintained by

This page is maintained by Adrian.

Genotype Likelihood based Allele Frequency

Contents

Introduction

Estimation of Genotype Frequencies without assuming HWE

Estimation of Genotype Frequencies assuming HWE

Used in

Derivation

Maintained by

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools