Minimac relies on a two step approach. First, the samples that are to be analyzed must be phased into a series of estimated haplotypes. Second, imputation is carried out directly into these phased haplotypes. As newer reference panels become available, only the second step must be repeated.
Pre-phasing - MaCH
A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:
./mach1 -d sample.dat -p sample.ped --rounds 20 --states 50 --phase --interim 5 --sample 5 --prefix sample.pp | tee mach.log
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 50 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the MaCH website):
||Data file in Merlin format. Markers should be listed according to their order along the chromosome.|
||Pedigree file in Merlin format. Alleles should be labeled on the forward strand.|
||Number of haplotypes to consider during each update. Increasing this value will typically lead to better haplotypes, but can dramatically increase computing time and memory use. A value of 200 - 400 is typical. If computing time is a constraints, you may consider decreasing this parameter to 100. If you have substantial computing resources, consider increasing this value to 600 or even 800.|
|| Iterations of the Markov sampler to use for haplotyping. Typically, using 20-30 rounds should give good results. To obtain better results, it is usually better to increase the |
||Request that intermediate results should be saved to disk periodically. These will facilitate analyses in case a run doesn't complete.|
||Request that random (but plausible) sets of haplotypes for each individual should be drawn every 5 iterations. This parameter is optional, but for some rare variant analyses, these alternative haplotypes can be very useful. If you are not short of disk space, you should consider enabling this parameter.|
||Tell MaCH to estimate phased haplotypes for each individual.|
||Reduce memory use at the cost of approximately doubling runtime.|
Imputation into Phased Haplotypes - minimac
Imputing genotypes using minimac is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include aliases file that provides mappings between earlier marker names and the current preferred name for each polymorphism.
The minimac command line would look like this:
./minimac --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac.log
A detailed description of all minimac options is available elsewhere. Here is a brief description of the above parameters:
||Reference haplotypes (e.g. from HapMap or the 1000 genomes Project).|
||List of sites in the reference haplotypes; needed unless the reference haplotypes are in VCF format.|
||This option specifies that the provided --refHaps file is provided in VCF format , no --refSNPs file needed.|
||SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.|
||Phased haplotypes where missing genotypes will be imputed.|
||Optionally, a string that is used to help generate output file names.|
You can speed-up things by running minimac in parallel by launching the minimac-omp version. On our cluster 4 cpus per minimac is optimal (--cpus 4).
./minimac-omp --cpus 4 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac-omp.log
Imputation quality evaluation
Minimac hides each of the genotyped SNPs in turn and then calculates 3 statistics:
- looRSQ - this is the estimated rsq for that SNP (as if SNP weren't typed).
- empR - this is the empirical correlation between true and imputed genotypes for the SNP. If this is negative, the SNP alleles are probably flipped.
- empRSQ - this is the actual R2 value, comparing imputed and true genotypes.
These statistics can be found in the *.info file
Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).
If you use minimac, please cite:
Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 2012