Changes

1,204 bytes added , 11:31, 2 February 2017

Line 1: Line 1: +

An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.

+

MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.

Line 4: Line 6:

The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).

−

MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://~~www~~.sph.umich.edu/~~csg~~/abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.

+

MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.

Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:

Line 21: Line 23:

...

'''<End of pedigree file>'''

+

Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:

+

'''<Example of a pedigree file with base-pair coded alleles>'''

+

FAM1001 ID1234 0 0 M A/A A/C C/C

+

FAM1002 ID5678 0 0 F A/C C/C G/G

+

...

+

'''<End of pedigree file>'''

+

Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:

+

'''<Example of a pedigree file with base-pair coded alleles>'''

+

FAM1003 ID1234 0 0 M ./. A/C C/C

+

FAM1004 ID5678 0 0 F 0/0 C/C G/G

+

...

+

'''<End of pedigree file>'''

+

Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].

Line 36: Line 55:

== Optional Phased Haplotypes ==

−

For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project. You can retrieve a current set of phased HapMap format haplotypes from ~~www.~~hapmap.org/downloads/phasing/.

+

For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project.

+

You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/.

+

HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://csg.sph.umich.edu//yli/mach/download/HapMap3.r2.b36.html

+

Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://csg.sph.umich.edu//yli/mach/download/

Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:

−

prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...

+

prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...

If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:

−

+

'''<Example of a snp list file>'''

−

marker1

+

marker1

−

marker2

+

marker2

−

...

+

...

−

+

marker13

−

In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the ~~legend~~ file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of ~~digits~~ on each line). ~~Also note that the alleles A, C, G, and T have been recoded as digits 1, 2, 3, and 4~~.

+

'''<End of snp list file>'''

+

In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line).

+

'''<Example of a phased haplotype file>'''

+

FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC

+

FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC

+

FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC

+

FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC

+

...

+

'''<End of phased haplotype file>'''

−

~~<Example of a phased haplotype file>~~

−

~~FAMILY1->PERSON1 HAPLO1 2332323244332~~

−

~~FAMILY1->PERSON1 HAPLO2 2332323422132~~

−

~~FAMILY2->PERSON1 HAPLO1 3332323244332~~

−

~~FAMILY2->PERSON1 HAPLO2 3311321242332~~

−

~~...~~

−

~~<End of phased haplotype file>~~

If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.

+

== Saving Disk Space ==

−

Useful Tip: You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.

+

'''Useful Tip:''' You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.

That is all you should need to get started!

Ppwhite

96

edits

Changes

MaCH: Input Files (view source)

Revision as of 11:31, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools