Difference between revisions of "MaCH"

From Genome Analysis Wiki
Jump to: navigation, search
(41 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[http://www.sph.umich.edu/csg/yli/mach/ '''MaCH'''] (MArkov Chain Haplotyping), mostly known as a software for genotype imputation, is a Hidden Markov Model (HMM) based haplotyper can that reconstruct haplotypes from genotypes of unrelated individuals. Three primary uses of MaCH are (1) to resolve haplotypes from diploid genotypes; (2) impute missing genotypes; and (3) perform disease mapping analysis.  
'''MaCH''' is a tool for haplotyping, genotype imputation and disease association analysis developed by Goncalo Abecasis and Yun Li. MaCH was first used to imputed missing genotypes in our FUSION genomewide association study ([http://www.sph.umich.edu/csg/abecasis/publications/17463248.html Scott et al, ''Science'', 2007]) and has since been used in the analysis of many other GWAS.  
== Input Files  ==
This page includes links to several useful MaCH related resources.
Mach takes unphased genotypes of unrelated individuals as input. Two input files are mandatory: a pedigree file and a marker information file. The pedigree file stores five key pieces of information and genotypes for each individual, with missing genotypes accepted and additional phenotypes allowed. The marker information file provides the list of marker names. Note that the list must be in order according to physical positions of the markers along the chromosomes. For more details, refer to http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html. <br>
* The main MaCH webpage at http://www.sph.umich.edu/csg/abecasis/MaCH/
* The MaCH download page, with source code, executables and reference haplotype files at http://www.sph.umich.edu/csg/abecasis/MaCH/download/
* The MaCH tutorial at http://www.sph.umich.edu/csg/abecasis/MaCH/tour/
* [[MaCH FAQ|The MaCH FAQ]]
* [[MaCH Options|MaCH Options]]
* [[MaCH: Input Files|Information on MaCH input formats]]
* [[MaCH: 1000 Genomes Imputation Cookbook|1000 Genomes Imputation Cookbook]]
* [[MaCH: machX|Chromosome X Imputation]]
* [[Mach2dat: Association with MACH output]]
=== '''Pedigree File (mandatory)'''<br>  ===
Currently, it also includes random notes on input file formats, but these probably need to be cleaned up!
Each person contributes to one line in a pedigree file. Required fields are (1) the first five fixed fields corresponding to five key pieces of information (namely: family person father mother sex), and (2) genotype fields. Phenotype fields are allowed inbetween but will not be used by the program. <br>
&lt;sample.ped&gt; <br>
  fam1 indiv1 0 0 1 0/0 2/3 ./.
  fam2 indiv2 0 0 2 1/2 2/2 1/4
&lt;EOF sample.ped&gt; <br>
This sample.ped contains 2 individuals. The first individual is from family fam1 with person ID indiv1 and no parental information available (father = 0, mother = 0). This person is a male (sex = 1). His genotypes are missing at the first and third markers (0/0 and ./.), and is 2/3 (C/G) at the second marker. Similarly, the second individual is from family fam2 with person ID indiv2 and no parental information available (father = 0, mother = 0). This person is a female (sex = 2). Her genotypes are 1/2 (A/C) at the first locus, 2/2 (Homozygous for C) at the second locus and 1/4 (A/T) at the third locus.
=== '''Marker Information File (mandatory)'''<br>  ===
  M SNP1
  M SNP2
  M SNP3
&lt;EOF sample.dat&gt;<br>
This file tells us that fields 6-8 in the pedigree file store genotypes for SNP1-3 correspondingly. Note again that the list of SNPs must be in their physical order along the chromosomes.
=== '''Optional Input files'''<br>  ===
==== External/reference files  ====
External/reference (e.g., HapMap) input files (snp and haplotype files) are optional. Mach 1.0 accepts two different formats: MACH format or HapMap format. <br>
===== MACH format SNP File  =====
One line per SNP and one field (marker name) only.
For example:
===== MACH format Haplotype File  =====
One line per haplotype. <br> Heading identification fields are optional. <br> Each non-haplotype/heading field shall not start with a numeric digit. <br>
For example:
  H_0001-&gt;H_0001 HAPLO1 2332323244332
  H_0001-&gt;H_0001 HAPLO2 2332323422132
  H_0002-&gt;H_0002 HAPLO1 3332323244332
  H_0002-&gt;H_0002 HAPLO2 3311321242332
===== HapMap format reference files  =====
HapMap format files can be downloaded from http://hapmap.org/downloads/phasing/2006-07_phaseII/phased/ or http://hapmap.org/downloads/phasing/2007-08_rel22/phased/
HapMap format SNP File: legend file downloaded from HapMap website <br> HapMap format Haplotype File: phase file downloaded from HapMap website <br>
<br> When using HapMap format files, turn on --hapmapFormat option.
  mach1 -d sample.dat -p sample.ped -s genotypes_chr14_CEU_r21_nr_fwd_legend.txt -h genotypes_chr14_CEU_r21_nr_fwd_phased.gz --hapmapFormat ...
==== Physical position file  ====
==== Parameter files  ====
== Options  ==
Input Files: --datfile Marker information file for subjects under study. --pedfile Pedigree file for subjects under study.
== FAQ  ==
=== '''Why and how to perform a 2-step imputation?'''  ===
A: When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br>
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br>
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br>
#step 1:
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log
# step 2:
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log
=== '''Where can I find combined HapMap reference files? '''  ===
A: http://www.sph.umich.edu/csg/yli/mach/download/HapMap-r21.html <br><br>
=== '''Where can I find HapMap III / 1000 Genomes reference files? '''  ===
A: http://www.sph.umich.edu/csg/yli/mach/download/ <br>
=== '''Does --mle overwrite fed-in genotypes?'''  ===
A: Yes. But rarely. --mle outputs the most likely genotype guesses by integrating over the probabilities of all possible configurations based on the reference haplotypes. The overwriting happens when the most likely guess differs from the experimental counterpart.<br><br>
=== '''How do I get imputation quality estimates?'''  ===
A: A simple approach is to use --mask option. For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout:
  Comparing 948352 masked genotypes with MLE estimates ...
  Estimated per genotype error rate is 0.0568
  Estimated per allele error rate is 0.0293
&nbsp;&nbsp;&nbsp; A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://www.sph.umich.edu/csg/ylwtx/CalcMatch.1.0.5.tgz CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]<br>
&nbsp;&nbsp;&nbsp; '''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br>
=== '''Shall I apply QC before or after imputation? If so, how? '''  ===
A: Yes. We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%.
=== '''How do I get reference files for an region of interest? '''  ===
A: (1) For HapMapII format, download http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br>
&nbsp;&nbsp;&nbsp; (2) For MACH format, you can do the following:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (2-1) First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (2-2) Then:
  @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
  @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (2-3) Finally (assuming the third field contains the actual haplotypes, where alleles are separated by nothing):
  awk '{print $3}' orig.hap | cut -c${first}-${last} &gt; region.hap
=== '''Do I have to sort the pedigree file by physical positions? '''  ===
A: If you use external reference, you do not have to as long as the external reference is in correct order. **HOWEVER**, we strongly recommend sorting the pedigree files. <br><br>
=== '''What if I specified --states R where R exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? '''  ===
A: mach automatically resets it to maximum possible value.
=== '''How is AL1 defined? Which allele dosage is .dose/.mldose counting?'''  ===
A: AL1 is an arbitrary allele. To be specific, it is the first allele read in the reference haplotypes (file fed to -h or --haps). The earliest versions of mach1 counted the number of AL2 and the latest versions count the number of AL1. One can find out which allele is counted following the steps below. <br>
Take your dosage, geno, and info output (.dose, .geno and .info or .mldose, .mlgeno, and .mlinfo depending on non-mle or mle manner you've used) and check if dosage is the number of AL1 copies or AL2 copies. Example is given below:
  '''head -1 mldose/chr21.mldose | cut -f3 -d ' ' '''
'''head -2 mlinfo/chr21.mlinfo '''
SNP Al1 Al2 Freq1 MAF Quality Rsq
rs885550 2 4 0.9840 0.0160 0.9682 0.0021
''' head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' '''
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.
=== '''Can I used unphased reference?'''  ===
A: Yes. You simply need a combined pedigree (.ped) and marker information file (.dat). <br>
&nbsp;&nbsp;&nbsp;&nbsp; For example, if you have:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''ref.hap'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ACGGA<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CCGAA
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''ref.snps'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SNP1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SNP2<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SNP3<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SNP4<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SNP5
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''sample.ped'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 1 0 0 1 A/A G/G&nbsp; <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''sample.dat'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP4
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The combined files should look like:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''comb.ped'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; r1 r1 0 0 1 A/C C/C G/G G/A A/A<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 1 0 0 1 A/A ./. ./. G/G ./.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''comb.dat'''<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP2<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP3<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP4<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; M SNP5<br>
=== '''More questions?'''  ===
A: Email Yun Li (yunli@med.unc.edu) or Goncalo Abecasis (goncalo@umich.edu).
== Examples  ==
  mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.hap -r 100 -o phase

Latest revision as of 08:34, 26 November 2010

MaCH is a tool for haplotyping, genotype imputation and disease association analysis developed by Goncalo Abecasis and Yun Li. MaCH was first used to imputed missing genotypes in our FUSION genomewide association study (Scott et al, Science, 2007) and has since been used in the analysis of many other GWAS.

This page includes links to several useful MaCH related resources.

Currently, it also includes random notes on input file formats, but these probably need to be cleaned up!