MaCH: Input Files

From Genome Analysis Wiki
Revision as of 11:31, 2 February 2017 by Ppwhite (talk | contribs) (→‎Optional Phased Haplotypes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.

MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.

Observed Genotypes

The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).

MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern Merlin / QTDT format or the classic LINKAGE format. Detailed descriptions of each format are available elsewhere (for example, see details of Merlin input formats), and here we focus on providing an overview of the bare essentials required for using MACH.

Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:

 <Example of a simple data file>
 M marker1
 M marker2
 ...
 <End of simple data file>

The actual genotypes are stored in a pedigree file. The pedigree file encodes one individual per row. Each row should start with an family id and individual id, followed by a father and mother id (which typically are both set to 0, 'zero', for unrelated individuals), and sex. These initial columns are followed by a series of marker genotypes, each with two alleles. We recommend that the alleles should be coded as A, C, G, T. For compatibility with older analysis tools, it is also possible to encode allels as 1 (for A), 2 (for C), 3 (for G) and 4 (for T). See below for an example:

 <Example of a pedigree file with base-pair coded alleles>
 FAM1001   ID1234  0   0   M   A A   A C   C C
 FAM1002   ID5678  0   0   F   A C   C C   G G
 ...
 <End of pedigree file>

Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:

 <Example of a pedigree file with base-pair coded alleles>
 FAM1001   ID1234  0   0   M   A/A   A/C   C/C
 FAM1002   ID5678  0   0   F   A/C   C/C   G/G
 ...
 <End of pedigree file>

Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:

 <Example of a pedigree file with base-pair coded alleles>
 FAM1003   ID1234  0   0   M   ./.   A/C   C/C
 FAM1004   ID5678  0   0   F   0/0   C/C   G/G
 ...
 <End of pedigree file>


Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see obsolete input formats.

In the MACH command line, the name of the data and pedigree files is indicated with the -d and -p options (in short hand form) or the --datfile and --pedfile options (in long form) respectively.

For example:

 mach -d genotypes.dat -p genotypes.ped

Or:

 mach --datfile genotypes.dat --pedfile genotypes.ped

Optional Phased Haplotypes

For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project.

You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/.

HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://csg.sph.umich.edu//yli/mach/download/HapMap3.r2.b36.html

Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://csg.sph.umich.edu//yli/mach/download/

Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:

 prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...

If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:

  <Example of a snp list file>
  marker1
  marker2
  ...
  marker13
  <End of snp list file>

In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line).

 <Example of a phased haplotype file>
 FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC
 FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC
 FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC
 FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC
 ...
 <End of phased haplotype file>

If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.

Saving Disk Space

Useful Tip: You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.

That is all you should need to get started!