Difference between revisions of "Minimac: Tutorial"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(2 intermediate revisions by one other user not shown)
Line 3: Line 3:
 
== Getting Started ==
 
== Getting Started ==
  
Download [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ MaCH] and [http://genome.sph.umich.edu/wiki/Minimac#Download Minimac] or [http://genome.sph.umich.edu/wiki/Minimac2#Download Minimac2]. Furthermore, example data used in this tutorial can be found [http://www.sph.umich.edu/csg/cfuchsb/minimac2_example.tgz here]
+
Download [http://csg.sph.umich.edu/abecasis/MaCH/download/ MaCH] and [http://genome.sph.umich.edu/wiki/Minimac#Download Minimac] or [http://genome.sph.umich.edu/wiki/Minimac2#Download Minimac2]. Furthermore, example data used in this tutorial can be found [http://csg.sph.umich.edu/cfuchsb/minimac2_example.tgz here]
  
 
== Minimac and Minimac2 Imputation ==
 
== Minimac and Minimac2 Imputation ==
Line 15: Line 15:
 
  ./mach1 -d sample.dat -p sample.ped --rounds 20 --states 50 --phase --interim 5 --sample 5  --prefix sample.pp | tee mach.log
 
  ./mach1 -d sample.dat -p sample.ped --rounds 20 --states 50 --phase --interim 5 --sample 5  --prefix sample.pp | tee mach.log
  
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 50 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the [http://www.sph.umich.edu/csg/abecasis/MaCH/ MaCH website]):
+
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 50 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the [http://csg.sph.umich.edu/abecasis/MaCH/ MaCH website]):
  
 
{| class="wikitable" border="1" cellpadding="2"
 
{| class="wikitable" border="1" cellpadding="2"
Line 23: Line 23:
 
|-  
 
|-  
 
|style=white-space:nowrap|<code>-d sample.dat</code>
 
|style=white-space:nowrap|<code>-d sample.dat</code>
| Data file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
+
| Data file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
 
|-  
 
|-  
 
| <code>-p sample.ped</code>
 
| <code>-p sample.ped</code>
| Pedigree file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
+
| Pedigree file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
 
|-
 
|-
 
| <code>--states 200</code>
 
| <code>--states 200</code>
Line 49: Line 49:
 
=== Imputation into Phased Haplotypes - minimac(2)===
 
=== Imputation into Phased Haplotypes - minimac(2)===
  
Imputing genotypes using '''minimac(2)''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include ''aliases'' file that provides mappings between earlier marker names and the current preferred name for each polymorphism.
+
Imputing genotypes using '''minimac(2)''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly.  
 +
 
 +
Minimac needs a file listing the variants in your sample. If your directory already includes a "sample.snps" file, no worries. If it doesn't, you can generate one using "sample.dat" as input with the following command:
 +
 
 +
  cut -f 2 -d " " sample.dat > sample.snps
  
 
The minimac command line would look like this:
 
The minimac command line would look like this:
Line 57: Line 61:
 
or
 
or
  
  ./minimac2 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac2.log
+
  ./minimac2 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample2.imp | tee minimac2.log
  
  
Line 101: Line 105:
 
or
 
or
  
  ./minimac2-omp --cpus 4 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac-omp.log
+
  ./minimac2-omp --cpus 4 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample2.imp | tee minimac-omp.log
  
 
== Imputation quality evaluation ==
 
== Imputation quality evaluation ==

Latest revision as of 12:34, 25 January 2017

Before reading this tutorial, you might find it useful to spend a few minutes reading through the main Minimac and Minimac2 documentation.

Getting Started

Download MaCH and Minimac or Minimac2. Furthermore, example data used in this tutorial can be found here

Minimac and Minimac2 Imputation

Minimac and Minimac2 relies on a two step approach. First, the samples that are to be analyzed must be phased into a series of estimated haplotypes. Second, imputation is carried out directly into these phased haplotypes. As newer reference panels become available, only the second step must be repeated.

Pre-phasing - MaCH

A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:

./mach1 -d sample.dat -p sample.ped --rounds 20 --states 50 --phase --interim 5 --sample 5  --prefix sample.pp | tee mach.log

This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 50 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the MaCH website):

Parameter Description
-d sample.dat Data file in Merlin format. Markers should be listed according to their order along the chromosome.
-p sample.ped Pedigree file in Merlin format. Alleles should be labeled on the forward strand.
--states 200 Number of haplotypes to consider during each update. Increasing this value will typically lead to better haplotypes, but can dramatically increase computing time and memory use. A value of 200 - 400 is typical. If computing time is a constraints, you may consider decreasing this parameter to 100. If you have substantial computing resources, consider increasing this value to 600 or even 800.
--rounds 20 Iterations of the Markov sampler to use for haplotyping. Typically, using 20-30 rounds should give good results. To obtain better results, it is usually better to increase the --states parameter.
--interim 5 Request that intermediate results should be saved to disk periodically. These will facilitate analyses in case a run doesn't complete.
--sample 5 Request that random (but plausible) sets of haplotypes for each individual should be drawn every 5 iterations. This parameter is optional, but for some rare variant analyses, these alternative haplotypes can be very useful. If you are not short of disk space, you should consider enabling this parameter.
--phase Tell MaCH to estimate phased haplotypes for each individual.
--compact Reduce memory use at the cost of approximately doubling runtime.

Imputation into Phased Haplotypes - minimac(2)

Imputing genotypes using minimac(2) is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly.

Minimac needs a file listing the variants in your sample. If your directory already includes a "sample.snps" file, no worries. If it doesn't, you can generate one using "sample.dat" as input with the following command:

 cut -f 2 -d " " sample.dat > sample.snps

The minimac command line would look like this:

./minimac --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac.log

or

./minimac2 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample2.imp | tee minimac2.log


A detailed description of all minimac(2) options is available elsewhere. Here is a brief description of the above parameters:

Parameter Description
--refHaps hapmap.hap Reference haplotypes (e.g. from HapMap or the 1000 genomes Project).
--refSnps hapmap.snps List of sites in the reference haplotypes; needed unless the reference haplotypes are in VCF format.
--vcfReference This option specifies that the provided --refHaps file is provided in VCF format , no --refSNPs file needed.
--snps chr.snps SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.
--haps chr.haps.gz Phased haplotypes where missing genotypes will be imputed.
--prefix imputed Optionally, a string that is used to help generate output file names.


You can speed-up things by running minimac in parallel by launching the minimac2-omp version. On our cluster 4 cpus per minimac(2) is optimal (--cpus 4).

./minimac-omp --cpus 4 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample.imp | tee minimac-omp.log

or

./minimac2-omp --cpus 4 --refHaps hapmap.hap --refSnps hapmap.snps --haps sample.pp.gz --snps sample.snps --prefix sample2.imp | tee minimac-omp.log

Imputation quality evaluation

Minimac hides each of the genotyped SNPs in turn and then calculates 3 statistics:

  • looRSQ - this is the estimated rsq for that SNP (as if SNP weren't typed).
  • empR - this is the empirical correlation between true and imputed genotypes for the SNP. If this is negative, the SNP alleles are probably flipped.
  • empRSQ - this is the actual R2 value, comparing imputed and true genotypes.

These statistics can be found in the *.info file

Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).

Reference

If you use minimac or minimac2, please cite:

Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 2012 [1]

Questions and Comments

Please contact Goncalo Abecasis, Christian Fuchsberger (minimac) or Yun Li (MaCH).