Minimac3 Info File

From Genome Analysis Wiki
Jump to: navigation, search

For versions earlier than 0.1.13 (downloaded before Oct 15, 2015 or from Imputation Server) please see older version of Minimac3 Info page !!!

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.

This wiki page is designed to give users a detailed explanation of the info file outputted by Minimac3.

Info File Descriptors

The available column descriptors for typical Miniamc3 output are as follows.

SNP

The SNP identifier for the variant. This is usually in the form of chr:position, but could be the rsid of the variant if the user had selected --rsid during the Minimac3 run (provided the input reference panel as the rsid in the INFO column).

REF(0), ALT(1)

These are the reference and alternate alleles for the variant as imported from the reference panel file (either VCF or M3VCF). The dosage value (see Dosage) in the .dose file is the alternate allele dosage and NOT major allele dosage as in earlier versions of minimac. Specifcally, the dosage denotes the probability P(REF,ALT) + 2*P(ALT,ALT).

ALT_Frq

This is the allele frequency of alternate (ALT) allele in the imputed dosage data (see Dosage).

MAF

This is the minor allele frequency of the variant in the imputed dosage data. Comparing the MAF to ALT_Frq would give one the minor allele.

AvgCall

This is the average probability (certainty) of observing the most likely allele for each haplotype. While, the R-square is an measure of the confidence in the imputed dosages, the Average Call is a measure of confidence in the most-likely genotypes

Rsq

This is the estimated value of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically 2p where p is the frequency of the allele being imputed.

Currently, Minimac3 uses the following definition (where \hat{p} is the alternate allele frequency and D_i is the imputed alternate allele probability at the i^{th} haplotype (see Dosage) and n is the number of GWAS samples) :

\hat{r}^2 = {{ {{1}\over{2n}} \times \sum_{i=1}^{2n} {(D_i - \hat{p})^2} }\over{\hat{p}(1-\hat{p})}}

Genotyped

This column in an indicator of whether the variant was "Genotyped", "Imputed" or "Genotyped_Only".

LooRsq

This statistic can only be provided for genotyped sites. This is similar to the estimated Rsq above, but the imputed dosages value used to compare are calculated by hiding all known genotypes for the given SNP (see LooDosage).

EmpR, EmpRsq

While the LooRsq statistic completely ignores experimental genotypes, EmpR is calculated by calculating the correlation between the true genotyped values and the imputed dosages that were calculated by hiding all known genotyped for the given SNP (see LooDosage). A negative correlation between imputed and experimental genotypes can indicate allele flips. This statistic also can only be provided for genotyped sites. EmpRsq is the square of this correlation.

Dose1

Average LooDosage at haplotypes with alternate allele at this site. A value of 0.97 denotes that out of all the haplotypes with an alternate allele at this site, 97% of them would get imputed accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 1.0, more accurately has that site been imputed.

Dose2

One minus the average LooDosage at haplotypes with reference allele at this site. A value of 0.03 denotes that out of all the haplotypes with a reference allele at this site, 3% of them would get imputed in-accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 0.0, more accurately has that site been imputed.

Dosage

Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.

Hard Genotype

Minimac3 uses maximum likelihood estimator for hard-call genotypes. For each haplotype, the allele with the maximum posterior probability is assigned, and the final genotype call is obtained from the hard-call haplotypes. For e.g. if the estimated posterior probability of the alternate allele is 0.56 and 0.60 in each haplotype, then the alternate allele is assigned to each haplotype and the final hard-call genotype is output as 1|1. Note that, the hard call genotype is NOT the MLE from the estimated genotype probabilities but instead from the estimated haplotype probabilities. For e.g. in this example, the posterior probability of the genotype 0|1 is maximum and equal to 0.48, but the output hard genotype is not 0|1 but 1|1, because at the haplotype level, each haplotype had more than 50% probability of alternate allele. As it is obvious, such cases will only arise when the sites are not imputed well.

LooDosage

Minimac3 uses an ad-hoc method to estimate imputation accuracy at sites that were genotyped in the study sample. For each such genotyped site, Minimac3 hides all known genotypes for that site and calculates an imputed dosage (in addition to the usual alternate allele dosage calculated assuming the genotypes are known at the site). This special imputed value is called Leave-One-Out dosage (LooDosage) and is only available for genotyped sites. LooDosage is used to calculate Empirical-Rsquare (EmpR, EmpRsq) by directly calculating the Pearson correlation coefficient between LooDosageand known genotypes. It is also used to estimate the LooRsq.

Download

Minimac3 is available as an undocumented release version. The source files (and binary executable) are available for download in Source Files and commonly used reference panels in VCF and M3VCF formats are available for download in Reference Panels.

Useful Wiki Pages

There are a few pages in this Wiki that may be useful to for Minimac3 users. Here are links to a few:

Contact

In case of any queries and bugs please contact Sayantan Das.