MaCH FAQ

How to speed up?

minimac

This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference).
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( Divide-and-Conquer) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time runtime estimate.
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See minimac for details.

MaCH-Admix

If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( Divide-and-Conquer). Details see MaCH-Admix.

Divide and Conquer

See MaCH Divide and Conquer for details.

2-step imputation

See 2-step imputation for details.

Why and how to perform a 2-step imputation?

When one has a large number of individuals (>1000), we recommend a 2-step imputation to speed up.

A 2-step imputation contains the following 2 steps:

(step 1) a representative subset of >= 200 unrelated individuals are used to calibrate model parameters; and
(step 2) actual genotype imputation is performed for every person using parameters inferred in step 1.

Example command lines for a 2-step imputation:

# step 1:
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer > mach.infer.log

# step 2:
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails > mach.imp.log

In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.

In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.

For other approaches to speed up, see [how to speed up].

Can MaCH perform imputation for chromosome X?

Yes. See MaCH X Chromosome for details.

Where can I find combined HapMap reference files?

You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.

Where can I find HapMap III / 1000 Genomes reference files?

You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/

Does --mle overwrite input genotypes?

Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype.

How do I get imputation quality estimates?

A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout:

 Comparing 948352 masked genotypes with MLE estimates ...
 Estimated per genotype error rate is 0.0568
 Estimated per allele error rate is 0.0293

A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use CalcMatch and doseR2.pl to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from http://csg.sph.umich.edu//ylwtx/software.html.

Warning: Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.

How do I interpret the imputation quality estimates?

In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.

http://csg.sph.umich.edu//yli/figure3.gif

Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.

Shall I apply QC before or after imputation? If so, how?

We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes >70% of poorly-imputed SNPs at the cost of <0.5% well-imputed SNPs) and MAF of 1%.

How do I get reference files for an region of interest?

Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded.
1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz
2. For MACH format, you can do the following:

First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.
Then, under csh:

@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`

under bash:

first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`

Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace

 head -1 orig.hap | wc -w

Note: if the haplotypes are gz compressed, do:

 zcat orig.hap.gz | head -1 | wc -w

Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):

 awk '{print $3}' orig.hap | cut -c${first}-${last} > region.hap

Note: if the haplotypes are gz compressed, do:

 zcat orig.hap.gz | awk '{print $3}' | cut -c${first}-${last} > region.hap

The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.

Do I always have to sort the pedigree file by marker position?

If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.

What if I specify --states R where R exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)?

Mach caps the number of states at the maximum possible value.

How is AL1 defined? Which allele dosage is .dose/.mldose counting?

AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below.

. First, find the two alleles for one of the markers in your data

 prompt> head -2 mlinfo/chr21.mlinfo 
 SNP      Al1 Al2 Freq1   MAF    Quality  Rsq 
 rs885550 2   4   0.9840  0.0160  0.9682  0.992

. Second, check the dosage for a few individuals at this SNP.

 prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' 
 1.962 
 1.000
 0.078

. Finally, compare these dosages to genotypes.

 prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' 
 2/2 
 2/4
 4/4

In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts.

Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno.

Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.

Can I used an unphased reference?

Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly.

For example, if you have:

reference.dat

M SNP1
M SNP2
M SNP3
M SNP4
M SNP5

reference.ped

REF1 REF1 0 0 1 A/C C/C G/G G/A A/A

sample.dat

M SNP1
M SNP4

sample.ped

1 1 0 0 1 A/A G/G

Your could create a combined data set as:

comb.dat

M SNP1
M SNP2
M SNP3
M SNP4
M SNP5

comb.ped

REF1 REF1 0 0 1 A/C C/C G/G G/A A/A

  1    1 0 0 1 A/A ./. ./. G/G ./.

Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files on-the-fly.

How big are the imputation output file?

For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.

How long does imputation take?

The following factors/parameters affect computational time:

m, # of genotyped markers (number of markers in .dat file)
n, # of individuals
h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)
r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)

Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11).

If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.

undefined symbol: gzopen64

If you see this message, you will need to re-compile the program. Type the following commands:

 make clear
 make all

New executables mach1 and thunder will then be generated under folder executables/

Install MaCH

We have source codes available through the MaCH download page: http://csg.sph.umich.edu//yli/mach/download/