Changes

From Genome Analysis Wiki
Jump to navigationJump to search
7,426 bytes added ,  17:01, 30 January 2013
Created page with 'Before reading this tutorial, you might find it useful to spend a few minutes reading through the main Minimac documentation. == Getting Started == === Example Data === #…'
Before reading this tutorial, you might find it useful to spend a few minutes reading through the main [[Minimac]] documentation.

== Getting Started ==

=== Example Data ===

# GWAS data

# Reference haplotypes


== Minimac Imputation ==

[[Minimac]] relies on a two step approach. First, the samples that are to be analyzed must be phased into a series of estimated haplotypes. Second, imputation is carried out directly into these phased haplotypes. As newer reference panels become available, only the second step must be repeated.

=== Pre-phasing - MaCH ===

A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:

mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps

This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the [http://www.sph.umich.edu/csg/abecasis/MaCH/ MaCH website]):

{| class="wikitable" border="1" cellpadding="2"
|- bgcolor="lightgray"
! Parameter
! Description
|-
|style=white-space:nowrap|<code>-d sample.dat</code>
| Data file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
|-
| <code>-p sample.ped</code>
| Pedigree file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
|-
| <code>--states 200</code>
| Number of haplotypes to consider during each update. Increasing this value will typically lead to better haplotypes, but can dramatically increase computing time and memory use. A value of 200 - 400 is typical. If computing time is a constraints, you may consider decreasing this parameter to 100. If you have substantial computing resources, consider increasing this value to 600 or even 800.
|-
| <code>--rounds 20</code>
| Iterations of the Markov sampler to use for haplotyping. Typically, using 20-30 rounds should give good results. To obtain better results, it is usually better to increase the <code>--states</code> parameter.
|-
| <code>--interim 5</code>
| Request that intermediate results should be saved to disk periodically. These will facilitate analyses in case a run doesn't complete.
|-
| <code>--sample 5</code>
| Request that random (but plausible) sets of haplotypes for each individual should be drawn every 5 iterations. This parameter is optional, but for some rare variant analyses, these alternative haplotypes can be very useful. If you are not short of disk space, you should consider enabling this parameter.
|-
| <code>--phase</code>
| Tell [[MaCH]] to estimate phased haplotypes for each individual.
|-
| <code>--compact</code>
| Reduce memory use at the cost of approximately doubling runtime.
|}


You should be able to run this step in parallel and in our cluster we'd use:

<source lang="text">
foreach chr (`seq 1 22`)

runon -m 4096 mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps

end
</source>

=== Imputation into Phased Haplotypes - minimac ===

Imputing genotypes using '''minimac''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include ''aliases'' file that provides mappings between earlier marker names and the current preferred name for each polymorphism.

A typical minimac command line, where the string $chr should be replaced with an appropriate chromosome number, might look like this:

== using a VCF reference panel ==
minimac --vcfReference --refHaps ref.vcf.gz --haps target.hap.gz --snps target.snps.gz --rounds 5 --states 200 --prefix results
Note: GWAS SNPs (file --snps target.snps.gz) are by default expected to be in the chr:pos format e.g. 1:1000 and on build37/hg19;
otherwise, please set the --rs flag and include an aliases file --snpAliase [http://www.sph.umich.edu/csg/abecasis/downloads/dbsnp134-merges.txt.gz dbsnp134-merges.txt.gz]


A detailed description of all minimac options is available [[Minimac Command Reference|elsewhere]]. Here is a brief description of the above parameters:

{| class="wikitable" border="1" cellpadding="2"
|- bgcolor="lightgray"
! Parameter
! Description
|-
| <code>--refHaps ref.hap.gz </code>
| Reference haplotypes (e.g. from [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-2010-06.html MaCH download page])
|-
| <code>--vcfReference </code>
| This option specifies that the provided --refHaps file is provided in VCF format , no --refSNPs file needed.
|-
| <code>--snps chr.snps </code>
| SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.
|-
| <code>--haps chr.haps.gz </code>
| Phased haplotypes where missing genotypes will be imputed.
<!-- |-
| <code>--rounds 5</code>
| Rounds of optimization for model parameters, which describe population recombination rates and per SNP error rates.
|-
| <code>--states 200</code>
| Maximum number of reference (or target) haplotypes to be examined during parameter optimization.
!-->
|-
| <code>--prefix imputed</code>
| Optionally, a string that is used to help generate output file names.
|}


You can speed-up things by running minimac in parallel by launching the [http://genome.sph.umich.edu/wiki/Minimac#Multiprocessor_Version minimac-omp] version. On our cluster 4 cpus per minimac is optimal (--cpus 4).

<source lang="text">
foreach chr (`seq 1 22`)

runon -m 1024 minimac-omp --cpus 4 --refHaps ref.hap.$chr.gz --vcfReference \
--haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed
end
</source>


== Imputation quality evaluation ==
Minimac hides each of the genotyped SNPs in turn and then calculates 3 statistics:
* looRSQ - this is the estimated rsq for that SNP (as if SNP weren't typed).
* empR - this is the empirical correlation between true and imputed genotypes for the SNP. If this is negative, the SNP alleles are probably flipped.
* empRSQ - this is the actual R2 value, comparing imputed and true genotypes.

These statistics can be found in the *.info file

Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).

= Reference =

If you use minimac, please cite:

Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR.
Fast and accurate genotype imputation in genome-wide association studies
through pre-phasing. Nature Genetics 2012 [http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2354.html]

== Questions and Comments ==

Please contact [mailto:goncalo@umich.edu Goncalo Abecasis], [mailto:cfuchsb@umich.edu Christian Fuchsberger (minimac)] or [mailto:yunli@med.unc.edu Yun Li (MaCH)].
550

edits

Navigation menu