Difference between revisions of "Minimac3 Info File"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with " = Introduction = [http://genome.sph.umich.edu/wiki/Minimac3 '''Minimac3 '''] is a lower memory and more computationally efficient implementation of [http://genome.sph.umich....")
 
 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
'''For versions earlier than 0.1.13 (downloaded before Oct 15, 2015 or from Imputation Server) please see [[Minimac3 Info (Older Version)| older version of Minimac3 Info page]]''' !!!
  
 
= Introduction =
 
= Introduction =
Line 4: Line 5:
 
[http://genome.sph.umich.edu/wiki/Minimac3 '''Minimac3 '''] is a lower memory and more computationally efficient implementation of [http://genome.sph.umich.edu/wiki/Minimac2 minimac2]. It is an algorithm for genotypic imputation that works on phased genotypes (say from [http://genome.sph.umich.edu/wiki/MaCH MaCH]) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.  
 
[http://genome.sph.umich.edu/wiki/Minimac3 '''Minimac3 '''] is a lower memory and more computationally efficient implementation of [http://genome.sph.umich.edu/wiki/Minimac2 minimac2]. It is an algorithm for genotypic imputation that works on phased genotypes (say from [http://genome.sph.umich.edu/wiki/MaCH MaCH]) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.  
  
This wiki page is designed to give users '''a detailed explanation on Minimac3 Usage'''.
+
This wiki page is designed to give users '''a detailed explanation of the info file outputted by Minimac3'''.
  
= Command Line Options =
+
= Info File Descriptors =
  
A typical Minimac3 command line would have the following parameter options:
+
The available column descriptors for typical Miniamc3 output are as follows.
  
Command Line Options:
+
==='''SNP'''===
    Reference Haplotypes : --refHaps [], --passOnly
 
      Target Haplotypes : --haps []
 
      Output Parameters : --processReference, --prefix [Minimac3.Output],
 
                          --updateModel, --nobgzip, --doseOutput, --hapOutput,
 
                          --format [GT,DS]
 
      Subset Parameters : --chr [], --start, --end, --window
 
    Starting Parameters : --rec [], --err []
 
  Estimation Parameters : --rounds [5], --states [200]
 
        Other Parameters : --help, --cpus [1], --params
 
              PhoneHome : --noPhoneHome, --phoneHomeThinning [50]
 
  
= Detailed Usage =
+
The SNP identifier for the variant. This is usually in the form of '''chr:position''', but could be the rsid of the variant if the user had selected <code>--rsid</code> during the Minimac3 run (provided the input reference panel as the rsid in the INFO column).
  
The available options of Minimac3 are explained in detail below. See wiki page on [[Minimac3 Examples|Examples]] and [[Minimac3 - Full List of Options |Full list of Options]] for more details. There is also a wiki-page on [[Minimac3 Imputation Cookbook]] which is recommended for new users !
+
==='''REF(0)''', '''ALT(1)'''===
  
==Reference Haplotypes==
+
These are the reference and alternate alleles for the variant as imported from the reference panel file (either VCF or M3VCF). The dosage value (see [[ #Dosage| '''Dosage''']]) in the <code>.dose</code> file is the alternate allele dosage and NOT major allele dosage as in earlier versions of '''minimac'''. Specifcally, the dosage denotes the probability <code>P(REF,ALT) + 2*P(ALT,ALT)</code>.
  
<font face=Courier>"--refHaps"</font> denotes the main input reference file could either be a VCF file or <font face=Courier>M3VCF</font> file. No handle is necessary for denoting type of file, program will detect it itself.
+
==='''ALT_Frq'''===
  
Minimac3 can handle both VCF files or <font face=Courier>M3VCF</font> files as input for the reference panel. The program can itself identify the type of file, and no handle is necessary for that.  <font face=Courier>M3VCF</font> files are customized files created by Minimac3 (possibly in some previous run) that stores large reference panels in a compact form so as to save memory and computation time involved in reading large files. See wiki page on [[M3VCF Files| <font face=Courier>M3VCF</font> files]] for further details. Users can download commonly used reference panels in both VCF and <font face=Courier>M3VCF</font> format from [[Minimac3#Reference Panels for Download |Reference Panels]].
+
This is the allele frequency of alternate (ALT) allele in the imputed dosage data (see [[ #Dosage| '''Dosage''']]).
  
==Target Haplotypes==
+
==='''MAF'''===
  
<font face=Courier>"--haps"</font> denotes the main input GWAS file which has to be a VCF file (<font face=Courier>.vcf</font> or <font face=Courier>.vcf.gz</font>). The extensions are not mandatory.  
+
This is the minor allele frequency of the variant in the imputed dosage data. Comparing the MAF to ALT_Frq would give one the minor allele.
  
Minimac3 can handle only VCF files as input for the GWAS data (see page on [[Minimac3 Cookbook : Converting Files to VCF|Converting Files to VCF]]). Note that input VCF files would be automatically assumed to be pre-phased (see page on [[Minimac3 Cookbook : Pre-Phasing|Pre-Phasing]]). Markers which are in the target panel and NOT in the reference panel would be excluded from the output files. User must merge these extra markers back to the original data in order to analyze them.
+
==='''AvgCall'''===
  
==Output Files==
+
This is the average probability (certainty) of observing the most likely allele for each haplotype. While, the R-square is an measure of the confidence in the imputed dosages, the Average Call is a measure of confidence in the most-likely genotypes
  
<font face=Courier>"--prefix"</font> denotes the prefix for the output files (By default: <font face=Courier>Minimac3.Output</font>)
+
==='''Rsq'''===
  
Minimac3 can output files in both <font face=Courier>VCF</font> format and <font face=Courier>.dose</font> format (usual [http://genome.sph.umich.edu/wiki/Minimac minimac] output format). By default, Minimac3 will only output in <font face=Courier>VCF</font> format and users must use the handle <font face=Courier>--doseOutput</font> to output in <font face=Courier>.dose</font> format or the handle <font face=Courier>--hapOutput</font> to output dosage data in phased format. Output VCF files can store dosage data only in the following formats and is managed by the handle <font face=Courier>--format</font> (by default : <font face=Courier>--format DS,GT</font>) :
+
This is the estimated value of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically <math>2p</math> where <math>p</math> is the frequency of the allele being imputed.
  
* '''DS''' : Estimated alternate allele dosage (default).
+
Currently, Minimac3 uses the following definition (where <math>\hat{p}</math> is the alternate allele frequency and <math>D_i</math> is the imputed alternate allele probability at the <math>i^{th}</math> haplotype (see [[ #Dosage| '''Dosage''']]) and <math>n</math> is the number of GWAS samples) :
* '''GT''' : Estimated most likely genotype (default).
 
* '''GP''' : Estimated posterior genotype probabilities (use handle <font face=Courier>--format GP</font>).
 
  
The handle <font face=Courier>--processReference</font> is used to ONLY convert reference panels from <font face=Courier>VCF</font> format to [[M3VCF Files|<font face=Courier>M3VCF</font>]] format (and save parameter estimates). NO imputation will be performed and thus NO target/gwas haplotypes are required. However, by default, parameter estimation will be done using the reference panel and the estimates will be saved in the <font face=Courier>M3VCF</font> files. Users should use <font face=Courier>--rounds  0</font> in order to opt out of parameter estimation and only compress the reference panel and save it as a <font face=Courier>M3VCF</font> file. See wiki page on [[Minimac3 Examples|Examples]] for further details.
+
:<math>\hat{r}^2 = {{ {{1}\over{2n}} \times \sum_{i=1}^{2n} {(D_i - \hat{p})^2} }\over{\hat{p}(1-\hat{p})}}</math>
  
[NOTE: While doing imputation, if parameter estimates are found in <font face=Courier>M3VCF</font> files, Minimac3 will automatically use them for imputation.  Users should use handle <font face=Courier>--updateModel</font> in order to update the parameter estimates using the target/gwas panel as well. However, this is NOT necessary in most cases, unless the user has strong reasons to believe that this might increase the imputation accuracy.]
+
==='''Genotyped'''===
  
== Remaining Parameters ==
+
This column in an indicator of whether the variant was "<code>Genotyped</code>", "<code>Imputed</code>" or "<code>Genotyped_Only</code>".
  
This sub-section explains the remaining parameters available.
+
==='''LooRsq'''===
  
* '''Subset Parameters:''' The subset parameters are required if the user wishes to impute into a particular region of the chromosome rather than the whole chromosome (typically used when running imputation in chunks). For example, to analyze chromosome 6 from position 1000000 to position 2000000 with 500000 base positions on either side as a buffer, one must use <font face=Courier>--chr 6 --from 1000000  --to 2000000 --window 500000 </font>. If using the subset parameters, a default window of 1Mbp is applied on either side, unless otherwise specified by the user. Variants from the buffer region are only used for imputation and not reported in the final output.
+
This statistic can only be provided for genotyped sites. This is similar to the estimated '''Rsq''' above, but the imputed dosages value used to compare are calculated by hiding all known genotypes for the given SNP (see [[ #LooDosage | '''LooDosage''']]).
  
* '''Starting Parameters:''' The starting parameters are used if the users wishes to use some previously created parameter estimate files to save time on parameter estimation (<font face=Courier>.recom</font> and <font face=Courier>.erate</font> files can be used with <font face=Courier>--rec</font> and <font face=Courier>--err</font> respectively).
+
==='''EmpR''', '''EmpRsq''' ===
  
* '''Estimation Parameters:''' The estimation parameters specify the number of iterations (<font face=Courier>--rounds [5]</font>) and number of states (<font face=Courier>--states [200]</font>) to consider while implementing the Hidden Markov Model for parameter estimation. Default values of 5 and 200 are used (these would generally give accurate enough estimates and need not be increased unless the user has strong reasons to do so).
+
While the '''LooRsq''' statistic completely ignores experimental genotypes, '''EmpR''' is calculated by calculating the correlation between the true genotyped values and the imputed dosages that were calculated by hiding all known genotyped for the given SNP (see [[ #LooDosage | '''LooDosage''']]). A negative correlation between imputed and experimental genotypes can indicate allele flips. This statistic also can only be provided for genotyped sites. '''EmpRsq''' is the square of this correlation.
  
* '''Other Parameters:''' These parameters have varying usage. <font face=Courier>--help</font> would print out a brief documentation of Minimac3 and its usage, <font face=Courier>--cpus [5]</font> would allow the user to use multiple processors when running in parallel (this option is only available when running Minimac3-omp), <font face=Courier>--params</font> is used to print out the current values for the usage parameters.
+
==='''Dose0'''===
  
* '''PhoneHome:''' This option (by default) sends a message to a University of Michigan database about the success/failure of the analysis run (and as to what kind of failure had occurred, if so). No information about the data, file or file-name is sent back. User should use the handle <font face=Courier>--noPhoneHome</font> to opt out from this option or should use <font face=Courier>--phoneHomeThinning 50</font> to send back a message with 50% chance (typically used when running lots of command lines).
+
Average [[ #LooDosage | '''LooDosage''']] at haplotypes with alternate allele at this site. A value of 0.97 denotes that out of all the haplotypes with an alternate allele at this site, 97% of them would get imputed accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 1.0, more accurately has that site been imputed.
 +
 
 +
==='''Dose1'''===
 +
 
 +
One minus the average [[ #LooDosage | '''LooDosage''']] at haplotypes with reference allele at this site. A value of 0.03 denotes that out of all the haplotypes with a reference allele at this site, 3% of them would get imputed in-accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 0.0, more accurately has that site been imputed.
 +
 
 +
= '''Dosage''' =
 +
 
 +
Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.
 +
 
 +
= '''Hard Genotype''' =
 +
 
 +
Minimac3 uses maximum likelihood estimator for hard-call genotypes. For each haplotype, the allele with the maximum posterior probability is assigned, and the final genotype call is obtained from the hard-call haplotypes. For e.g. if the estimated posterior probability of the alternate allele is 0.56 and 0.60 in each haplotype, then the alternate allele is assigned to each haplotype and the final hard-call genotype is output as 1|1. Note that, the hard call genotype is NOT the MLE from the estimated genotype probabilities but instead from the estimated haplotype probabilities. For e.g. in this example, the posterior probability of the genotype 0|1 is maximum and equal to 0.48, but the output hard genotype is not 0|1 but 1|1, because at the haplotype level, each haplotype had more than 50% probability of alternate allele. As it is obvious, such cases will only arise when the sites are not imputed well.
 +
 
 +
= '''LooDosage''' =
 +
 
 +
Minimac3 uses an ad-hoc method to estimate imputation accuracy at sites that were genotyped in the study sample. For each such genotyped site, Minimac3 hides all known genotypes for that site and calculates an imputed dosage (in addition to the usual alternate allele dosage calculated assuming the genotypes are known at the site). This special imputed value is called Leave-One-Out dosage ('''LooDosage''') and is only available for genotyped sites. '''LooDosage''' is used to calculate Empirical-Rsquare ('''EmpR''', '''EmpRsq''') by directly calculating the Pearson correlation coefficient between '''LooDosage'''and known genotypes. It is also used to estimate the '''LooRsq'''.
  
 
= Download =
 
= Download =

Latest revision as of 17:48, 14 December 2018

For versions earlier than 0.1.13 (downloaded before Oct 15, 2015 or from Imputation Server) please see older version of Minimac3 Info page !!!

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.

This wiki page is designed to give users a detailed explanation of the info file outputted by Minimac3.

Info File Descriptors

The available column descriptors for typical Miniamc3 output are as follows.

SNP

The SNP identifier for the variant. This is usually in the form of chr:position, but could be the rsid of the variant if the user had selected --rsid during the Minimac3 run (provided the input reference panel as the rsid in the INFO column).

REF(0), ALT(1)

These are the reference and alternate alleles for the variant as imported from the reference panel file (either VCF or M3VCF). The dosage value (see Dosage) in the .dose file is the alternate allele dosage and NOT major allele dosage as in earlier versions of minimac. Specifcally, the dosage denotes the probability P(REF,ALT) + 2*P(ALT,ALT).

ALT_Frq

This is the allele frequency of alternate (ALT) allele in the imputed dosage data (see Dosage).

MAF

This is the minor allele frequency of the variant in the imputed dosage data. Comparing the MAF to ALT_Frq would give one the minor allele.

AvgCall

This is the average probability (certainty) of observing the most likely allele for each haplotype. While, the R-square is an measure of the confidence in the imputed dosages, the Average Call is a measure of confidence in the most-likely genotypes

Rsq

This is the estimated value of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically where is the frequency of the allele being imputed.

Currently, Minimac3 uses the following definition (where is the alternate allele frequency and is the imputed alternate allele probability at the haplotype (see Dosage) and is the number of GWAS samples) :

Genotyped

This column in an indicator of whether the variant was "Genotyped", "Imputed" or "Genotyped_Only".

LooRsq

This statistic can only be provided for genotyped sites. This is similar to the estimated Rsq above, but the imputed dosages value used to compare are calculated by hiding all known genotypes for the given SNP (see LooDosage).

EmpR, EmpRsq

While the LooRsq statistic completely ignores experimental genotypes, EmpR is calculated by calculating the correlation between the true genotyped values and the imputed dosages that were calculated by hiding all known genotyped for the given SNP (see LooDosage). A negative correlation between imputed and experimental genotypes can indicate allele flips. This statistic also can only be provided for genotyped sites. EmpRsq is the square of this correlation.

Dose0

Average LooDosage at haplotypes with alternate allele at this site. A value of 0.97 denotes that out of all the haplotypes with an alternate allele at this site, 97% of them would get imputed accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 1.0, more accurately has that site been imputed.

Dose1

One minus the average LooDosage at haplotypes with reference allele at this site. A value of 0.03 denotes that out of all the haplotypes with a reference allele at this site, 3% of them would get imputed in-accurately to the alternate allele, if this site was assumed to be not genotyped. The closer the value is to 0.0, more accurately has that site been imputed.

Dosage

Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.

Hard Genotype

Minimac3 uses maximum likelihood estimator for hard-call genotypes. For each haplotype, the allele with the maximum posterior probability is assigned, and the final genotype call is obtained from the hard-call haplotypes. For e.g. if the estimated posterior probability of the alternate allele is 0.56 and 0.60 in each haplotype, then the alternate allele is assigned to each haplotype and the final hard-call genotype is output as 1|1. Note that, the hard call genotype is NOT the MLE from the estimated genotype probabilities but instead from the estimated haplotype probabilities. For e.g. in this example, the posterior probability of the genotype 0|1 is maximum and equal to 0.48, but the output hard genotype is not 0|1 but 1|1, because at the haplotype level, each haplotype had more than 50% probability of alternate allele. As it is obvious, such cases will only arise when the sites are not imputed well.

LooDosage

Minimac3 uses an ad-hoc method to estimate imputation accuracy at sites that were genotyped in the study sample. For each such genotyped site, Minimac3 hides all known genotypes for that site and calculates an imputed dosage (in addition to the usual alternate allele dosage calculated assuming the genotypes are known at the site). This special imputed value is called Leave-One-Out dosage (LooDosage) and is only available for genotyped sites. LooDosage is used to calculate Empirical-Rsquare (EmpR, EmpRsq) by directly calculating the Pearson correlation coefficient between LooDosageand known genotypes. It is also used to estimate the LooRsq.

Download

Minimac3 is available as an undocumented release version. The source files (and binary executable) are available for download in Source Files and commonly used reference panels in VCF and M3VCF formats are available for download in Reference Panels.

Useful Wiki Pages

There are a few pages in this Wiki that may be useful to for Minimac3 users. Here are links to a few:

Contact

In case of any queries and bugs please contact Sayantan Das.