Difference between revisions of "Minimac Diagnostics"
(One intermediate revision by one other user not shown) | |||
Line 11: | Line 11: | ||
The next column in the ''.info'' file lists the estimated frequency of allele 1 - this corresponds to the average number of imputed copies of allele 1 for each individual, divided by two. | The next column in the ''.info'' file lists the estimated frequency of allele 1 - this corresponds to the average number of imputed copies of allele 1 for each individual, divided by two. | ||
− | === Estimated Imputation Accuracy === | + | === Estimated Imputation Accuracy (ImpRsq) === |
Frequency information is followed by an estimate of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically <math>2p</math> where <math>p</math> is the frequency of the allele being imputed. | Frequency information is followed by an estimate of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically <math>2p</math> where <math>p</math> is the frequency of the allele being imputed. | ||
Line 20: | Line 20: | ||
== Leave One Out Statistics == | == Leave One Out Statistics == | ||
+ | |||
+ | To evaluate imputation quality, Minimac hides data for each genotyped SNP in turn and calculates 3 statistics, described below. | ||
=== looRsq : Estimated R-squared in Leave-One-Out Analysis === | === looRsq : Estimated R-squared in Leave-One-Out Analysis === | ||
+ | |||
+ | This first statistic is calculated by hiding all known genotypes for the SNP, imputing it and then estimating imputation accuracy. It doesn't use the known genotypes for the SNP at all. | ||
=== empR : Correlation Between Imputed and True Genotypes === | === empR : Correlation Between Imputed and True Genotypes === | ||
+ | |||
+ | Whereas looRsq statistic completely ignores experimental genotypes, this one is based on a comparison of imputed and experimental genotypes. A negative correlation between imputed and experimental genotypes can indicate allele flips. | ||
=== empRsq : Squared Correlation Between Imputed and True Genotypes === | === empRsq : Squared Correlation Between Imputed and True Genotypes === | ||
+ | |||
+ | Whereas looRsq statistic reports the estimated imputation accuracy, this one reports the ''actual'' imputation accuracy - as estimated by comparing genotypes generated using imputation (after hiding any known genotypes for the marker) and the previously hidden known genotypes. By comparing empRsq and looRsq it should be possible to tell whether estimates of imputation accuracy are well calibrated. |
Latest revision as of 11:35, 8 June 2017
minimac is a tool for imputation of missing genotypes into phased haplotypes. At the end of each run, minimac generates summaries of imputation quality and stores those in a .info file.
Basic Descriptors
Marker and Allele Labels
The first three columns in the .info file list marker name and alleles for each marker. Typically, the most common allele will be listed first, but this is not guaranteed.
Estimated Allele Frequency
The next column in the .info file lists the estimated frequency of allele 1 - this corresponds to the average number of imputed copies of allele 1 for each individual, divided by two.
Estimated Imputation Accuracy (ImpRsq)
Frequency information is followed by an estimate of the squared correlation between imputed genotypes and true, unobserved genotypes. Since true genotypes are not available, this calculation is based on the idea that poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone; specifically where is the frequency of the allele being imputed.
Currently, minimac uses the following definition:
Leave One Out Statistics
To evaluate imputation quality, Minimac hides data for each genotyped SNP in turn and calculates 3 statistics, described below.
looRsq : Estimated R-squared in Leave-One-Out Analysis
This first statistic is calculated by hiding all known genotypes for the SNP, imputing it and then estimating imputation accuracy. It doesn't use the known genotypes for the SNP at all.
empR : Correlation Between Imputed and True Genotypes
Whereas looRsq statistic completely ignores experimental genotypes, this one is based on a comparison of imputed and experimental genotypes. A negative correlation between imputed and experimental genotypes can indicate allele flips.
empRsq : Squared Correlation Between Imputed and True Genotypes
Whereas looRsq statistic reports the estimated imputation accuracy, this one reports the actual imputation accuracy - as estimated by comparing genotypes generated using imputation (after hiding any known genotypes for the marker) and the previously hidden known genotypes. By comparing empRsq and looRsq it should be possible to tell whether estimates of imputation accuracy are well calibrated.