Difference between revisions of "MaCH"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
'''MaCH''' (MArkov Chain Haplotyping), mostly known as a software for genotype imputation, is a Hidden Markov Model (HMM) based haplotyper that reconstructs haplotypes from genotypes of unrelated individuals. Three primary utilities of MaCH are (1) to resolve haplotypes from diploid genotypes; (2) impute missing genotypes; and (3) perform disease mapping analysis.
+
'''MaCH''' (MArkov Chain Haplotyping), mostly known as a software for genotype imputation, is a Hidden Markov Model (HMM) based haplotyper that reconstructs haplotypes from genotypes of unrelated individuals. Three primary utilities of MaCH are (1) to resolve haplotypes from diploid genotypes; (2) impute missing genotypes; and (3) perform disease mapping analysis.  
  
 +
<br>
  
== Input Files ==
+
== Input Files ==
  
Mach takes unphased genotypes of unrelated individuals as input. Two input files are mandatory: a pedigree file and a marker information file. The pedigree file stores five key pieces of information and genotypes for each individual, with missing genotypes accepted and additional phenotypes allowed. The marker information file provides the list of marker names. Note that the list must be in order according to physical positions of the markers along the chromosomes. For more details, refer to  
+
Mach takes unphased genotypes of unrelated individuals as input. Two input files are mandatory: a pedigree file and a marker information file. The pedigree file stores five key pieces of information and genotypes for each individual, with missing genotypes accepted and additional phenotypes allowed. The marker information file provides the list of marker names. Note that the list must be in order according to physical positions of the markers along the chromosomes. For more details, refer to http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html. <br>  
http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html. <br>
 
  
=== <strong>Pedigree File (mandatory)</strong><br> ===
+
=== '''Pedigree File (mandatory)'''<br> ===
Each person contributes to one line in a pedigree file. Required fields are (1) the first five fixed fields corresponding to five key pieces of information (namely: family person father mother sex), and (2) genotype fields. Phenotype fields are allowed inbetween but will not be used by the program. <br>
+
 
 +
Each person contributes to one line in a pedigree file. Required fields are (1) the first five fixed fields corresponding to five key pieces of information (namely: family person father mother sex), and (2) genotype fields. Phenotype fields are allowed inbetween but will not be used by the program. <br>
 +
 
 +
&lt;sample.ped&gt; <br>  
  
<sample.ped> <br>
 
 
   fam1 indiv1 0 0 1 0/0 2/3 ./.  
 
   fam1 indiv1 0 0 1 0/0 2/3 ./.  
 
   fam2 indiv2 0 0 2 1/2 2/2 1/4  
 
   fam2 indiv2 0 0 2 1/2 2/2 1/4  
<EOF sample.ped> <br>
+
 
 +
&lt;EOF sample.ped&gt; <br>  
  
 
This sample.ped contains 2 individuals. The first individual is from family fam1 with person ID indiv1 and no parental information available (father = 0, mother = 0). This person is a male (sex = 1). His genotypes are missing at the first and third markers (0/0 and ./.), and is 2/3 (C/G) at the second marker. Similarly, the second individual is from family fam2 with person ID indiv2 and no parental information available (father = 0, mother = 0). This person is a female (sex = 2). Her genotypes are 1/2 (A/C) at the first locus, 2/2 (Homozygous for C) at the second locus and 1/4 (A/T) at the third locus.  
 
This sample.ped contains 2 individuals. The first individual is from family fam1 with person ID indiv1 and no parental information available (father = 0, mother = 0). This person is a male (sex = 1). His genotypes are missing at the first and third markers (0/0 and ./.), and is 2/3 (C/G) at the second marker. Similarly, the second individual is from family fam2 with person ID indiv2 and no parental information available (father = 0, mother = 0). This person is a female (sex = 2). Her genotypes are 1/2 (A/C) at the first locus, 2/2 (Homozygous for C) at the second locus and 1/4 (A/T) at the third locus.  
  
=== <strong>Marker Information File (mandatory)</strong><br> ===
+
=== '''Marker Information File (mandatory)'''<br> ===
 +
 
 +
&lt;sample.dat&gt;<br>
  
<sample.dat><br>
 
 
   M SNP1
 
   M SNP1
 
   M SNP2
 
   M SNP2
 
   M SNP3
 
   M SNP3
<EOF sample.dat><br>
+
 
 +
&lt;EOF sample.dat&gt;<br>  
  
 
This file tells us that fields 6-8 in the pedigree file store genotypes for SNP1-3 correspondingly. Note again that the list of SNPs must be in their physical order along the chromosomes.  
 
This file tells us that fields 6-8 in the pedigree file store genotypes for SNP1-3 correspondingly. Note again that the list of SNPs must be in their physical order along the chromosomes.  
  
 +
<br>
  
=== <strong>Optional Input files</strong><br> ===
+
=== '''Optional Input files'''<br> ===
  
==== External/reference files ====
+
==== External/reference files ====
External/reference (e.g., HapMap) input files (snp and haplotype files) are optional. Mach 1.0 accepts two different formats: MACH format or HapMap format. <br>
 
  
===== MACH format SNP File =====
+
External/reference (e.g., HapMap) input files (snp and haplotype files) are optional. Mach 1.0 accepts two different formats: MACH format or HapMap format. <br>
---------------------------
 
One line per SNP and one field (marker name) only.
 
  
For example:
+
===== MACH format SNP File  =====
 +
 
 +
----
 +
 
 +
One line per SNP and one field (marker name) only.
 +
 
 +
For example:  
  
 
  marker1
 
  marker1
Line 43: Line 52:
 
  ...
 
  ...
  
===== MACH format Haplotype File =====
+
===== MACH format Haplotype File =====
---------------------------------
+
 
One line per haplotype. <br>
+
----
Heading identification fields are optional. <br>
+
 
Each non-haplotype/heading field shall not start with a numeric digit. <br>
+
One line per haplotype. <br> Heading identification fields are optional. <br> Each non-haplotype/heading field shall not start with a numeric digit. <br>  
  
For example:
+
For example:  
  
   H_0001->H_0001 HAPLO1 2332323244332
+
   H_0001-&gt;H_0001 HAPLO1 2332323244332
   H_0001->H_0001 HAPLO2 2332323422132
+
   H_0001-&gt;H_0001 HAPLO2 2332323422132
   H_0002->H_0002 HAPLO1 3332323244332
+
   H_0002-&gt;H_0002 HAPLO1 3332323244332
   H_0002->H_0002 HAPLO2 3311321242332
+
   H_0002-&gt;H_0002 HAPLO2 3311321242332
 
   ...
 
   ...
  
===== HapMap format reference files =====
+
===== HapMap format reference files =====
HapMap format files can be downloaded from
 
http://hapmap.org/downloads/phasing/2006-07_phaseII/phased/
 
or
 
http://hapmap.org/downloads/phasing/2007-08_rel22/phased/
 
  
HapMap format SNP File: legend file downloaded from HapMap website <br>
+
HapMap format files can be downloaded from http://hapmap.org/downloads/phasing/2006-07_phaseII/phased/ or http://hapmap.org/downloads/phasing/2007-08_rel22/phased/
HapMap format Haplotype File: phase file downloaded from HapMap website <br>
 
  
 +
HapMap format SNP File: legend file downloaded from HapMap website <br> HapMap format Haplotype File: phase file downloaded from HapMap website <br>
  
When using HapMap format files, turn on --hapmapFormat option.
+
<br> When using HapMap format files, turn on --hapmapFormat option.  
  
 
   mach1 -d sample.dat -p sample.ped -s genotypes_chr14_CEU_r21_nr_fwd_legend.txt -h genotypes_chr14_CEU_r21_nr_fwd_phased.gz --hapmapFormat ...
 
   mach1 -d sample.dat -p sample.ped -s genotypes_chr14_CEU_r21_nr_fwd_legend.txt -h genotypes_chr14_CEU_r21_nr_fwd_phased.gz --hapmapFormat ...
  
==== Physical position file ====
+
==== Physical position file ====
 +
 
 +
==== Parameter files  ====
  
==== Parameter files ====
+
== Options  ==
  
== Options ==
+
Input Files: --datfile Marker information file for subjects under study. --pedfile Pedigree file for subjects under study.
  
Input Files:
+
<br>
--datfile
 
Marker information file for subjects under study.
 
--pedfile
 
Pedigree file for subjects under study.
 
  
 +
== FAQ  ==
  
 +
'''Q''': Where can I find combined HapMap reference files? <br> A: http://www.sph.umich.edu/csg/yli/mach/download/HapMap-r21.html <br><br>
  
== FAQ ==
+
Q: Where can I find HapMap III reference files? <br> A: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br>
  
<strong>Q</strong>: Where can I find combined HapMap reference files? <br>
+
Q: Does --mle overwrite fed-in genotypes?<br> A: Yes. But rarely. --mle outputs the most likely genotype guesses by integrating over the probabilities of all possible configurations based on the reference haplotypes. The overwriting happens when the most likely guess differs from the experimental counterpart.<br><br>  
A: http://www.sph.umich.edu/csg/yli/mach/download/HapMap-r21.html <br><br>
 
  
Q: Where can I find HapMap III reference files? <br>
+
Q: How do I get reference files for an region of interest? <br> A: For HapMapII format, download http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br>  
A: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br>
 
  
Q: Does --mle overwrite fed-in genotypes?<br>
+
For MACH format, you can do the following:
A: Yes. But rarely. --mle outputs the most likely genotype guesses by integrating over the probabilities of all possible configurations based on the reference haplotypes. The overwriting happens when the most likely guess differs from the experimental counterpart.<br><br>
 
  
Q: How do I get reference files for an region of interest? <br>
+
First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.  
A: For HapMapII format, download http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br>
 
  For MACH format, you can do the following:
 
First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.
 
  
 
Then:  
 
Then:  
 +
 
   @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
 
   @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
 
   @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`
 
   @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`
  
Finally (assuming the third field contains the actual haplotypes, where alleles are separated by nothing):
+
Finally (assuming the third field contains the actual haplotypes, where alleles are separated by nothing):  
 +
 
 +
  awk '{print $3}' orig.hap | cut -c${first}-${last} &gt; region.hap
  
  awk '{print $3}' orig.hap | cut -c${first}-${last} > region.hap
+
== Examples  ==
  
== Examples ==
+
Imputation
  
Imputation
 
 
   mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.hap -r 100 -o phase
 
   mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.hap -r 100 -o phase

Revision as of 15:10, 30 April 2010

MaCH (MArkov Chain Haplotyping), mostly known as a software for genotype imputation, is a Hidden Markov Model (HMM) based haplotyper that reconstructs haplotypes from genotypes of unrelated individuals. Three primary utilities of MaCH are (1) to resolve haplotypes from diploid genotypes; (2) impute missing genotypes; and (3) perform disease mapping analysis.


Input Files

Mach takes unphased genotypes of unrelated individuals as input. Two input files are mandatory: a pedigree file and a marker information file. The pedigree file stores five key pieces of information and genotypes for each individual, with missing genotypes accepted and additional phenotypes allowed. The marker information file provides the list of marker names. Note that the list must be in order according to physical positions of the markers along the chromosomes. For more details, refer to http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html.

Pedigree File (mandatory)

Each person contributes to one line in a pedigree file. Required fields are (1) the first five fixed fields corresponding to five key pieces of information (namely: family person father mother sex), and (2) genotype fields. Phenotype fields are allowed inbetween but will not be used by the program.

<sample.ped>

 fam1 indiv1 0 0 1 0/0 2/3 ./. 
 fam2 indiv2 0 0 2 1/2 2/2 1/4 

<EOF sample.ped>

This sample.ped contains 2 individuals. The first individual is from family fam1 with person ID indiv1 and no parental information available (father = 0, mother = 0). This person is a male (sex = 1). His genotypes are missing at the first and third markers (0/0 and ./.), and is 2/3 (C/G) at the second marker. Similarly, the second individual is from family fam2 with person ID indiv2 and no parental information available (father = 0, mother = 0). This person is a female (sex = 2). Her genotypes are 1/2 (A/C) at the first locus, 2/2 (Homozygous for C) at the second locus and 1/4 (A/T) at the third locus.

Marker Information File (mandatory)

<sample.dat>

 M SNP1
 M SNP2
 M SNP3

<EOF sample.dat>

This file tells us that fields 6-8 in the pedigree file store genotypes for SNP1-3 correspondingly. Note again that the list of SNPs must be in their physical order along the chromosomes.


Optional Input files

External/reference files

External/reference (e.g., HapMap) input files (snp and haplotype files) are optional. Mach 1.0 accepts two different formats: MACH format or HapMap format.

MACH format SNP File

One line per SNP and one field (marker name) only.

For example:

marker1
marker2
...
MACH format Haplotype File

One line per haplotype.
Heading identification fields are optional.
Each non-haplotype/heading field shall not start with a numeric digit.

For example:

 H_0001->H_0001 HAPLO1 2332323244332
 H_0001->H_0001 HAPLO2 2332323422132
 H_0002->H_0002 HAPLO1 3332323244332
 H_0002->H_0002 HAPLO2 3311321242332
 ...
HapMap format reference files

HapMap format files can be downloaded from http://hapmap.org/downloads/phasing/2006-07_phaseII/phased/ or http://hapmap.org/downloads/phasing/2007-08_rel22/phased/

HapMap format SNP File: legend file downloaded from HapMap website
HapMap format Haplotype File: phase file downloaded from HapMap website


When using HapMap format files, turn on --hapmapFormat option.

 mach1 -d sample.dat -p sample.ped -s genotypes_chr14_CEU_r21_nr_fwd_legend.txt -h genotypes_chr14_CEU_r21_nr_fwd_phased.gz --hapmapFormat ...

Physical position file

Parameter files

Options

Input Files: --datfile Marker information file for subjects under study. --pedfile Pedigree file for subjects under study.


FAQ

Q: Where can I find combined HapMap reference files?
A: http://www.sph.umich.edu/csg/yli/mach/download/HapMap-r21.html

Q: Where can I find HapMap III reference files?
A: http://www.sph.umich.edu/csg/yli/mach/download/

Q: Does --mle overwrite fed-in genotypes?
A: Yes. But rarely. --mle outputs the most likely genotype guesses by integrating over the probabilities of all possible configurations based on the reference haplotypes. The overwriting happens when the most likely guess differs from the experimental counterpart.

Q: How do I get reference files for an region of interest?
A: For HapMapII format, download http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz

For MACH format, you can do the following:

First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.

Then:

 @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
 @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`

Finally (assuming the third field contains the actual haplotypes, where alleles are separated by nothing):

 awk '{print $3}' orig.hap | cut -c${first}-${last} > region.hap

Examples

Imputation

 mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.hap -r 100 -o phase