Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,204 bytes added ,  11:31, 2 February 2017
Line 1: Line 1:  +
An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.
 +
 
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.
 
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.
   Line 4: Line 6:  
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).
 
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).
   −
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.
+
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.
    
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:
 
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:
Line 21: Line 23:  
   ...
 
   ...
 
   '''<End of pedigree file>'''
 
   '''<End of pedigree file>'''
 +
 +
Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:
 +
 +
  '''<Example of a pedigree file with base-pair coded alleles>'''
 +
  FAM1001  ID1234  0  0  M  A/A  A/C  C/C
 +
  FAM1002  ID5678  0  0  F  A/C  C/C  G/G
 +
  ...
 +
  '''<End of pedigree file>'''
 +
 +
Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:
 +
 +
  '''<Example of a pedigree file with base-pair coded alleles>'''
 +
  FAM1003  ID1234  0  0  M  ./.  A/C  C/C
 +
  FAM1004  ID5678  0  0  F  0/0  C/C  G/G
 +
  ...
 +
  '''<End of pedigree file>'''
 +
    
Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].
 
Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].
Line 36: Line 55:  
== Optional Phased Haplotypes ==
 
== Optional Phased Haplotypes ==
   −
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project. You can retrieve a current set of phased HapMap format haplotypes from www.hapmap.org/downloads/phasing/.
+
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project.  
 +
 
 +
You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/.
 +
 
 +
HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://csg.sph.umich.edu//yli/mach/download/HapMap3.r2.b36.html
 +
 
 +
Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://csg.sph.umich.edu//yli/mach/download/
    
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:
 
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:
   −
prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...
+
  prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...
    
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:
 
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:
   −
<Example of a snp list file>
+
  '''<Example of a snp list file>'''
marker1
+
  marker1
marker2
+
  marker2
...
+
  ...
<End of snp list file>
+
  marker13
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the legend file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of digits on each line). Also note that the alleles A, C, G, and T have been recoded as digits 1, 2, 3, and 4.
+
  '''<End of snp list file>'''
 +
 
 +
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line).  
 +
 
 +
  '''<Example of a phased haplotype file>'''
 +
  FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC
 +
  FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC
 +
  FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC
 +
  FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC
 +
  ...
 +
  '''<End of phased haplotype file>'''
   −
<Example of a phased haplotype file>
  −
FAMILY1->PERSON1 HAPLO1 2332323244332
  −
FAMILY1->PERSON1 HAPLO2 2332323422132
  −
FAMILY2->PERSON1 HAPLO1 3332323244332
  −
FAMILY2->PERSON1 HAPLO2 3311321242332
  −
...
  −
<End of phased haplotype file>
   
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.
 
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.
    +
== Saving Disk Space ==
   −
Useful Tip: You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.
+
'''Useful Tip:''' You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.
    
That is all you should need to get started!
 
That is all you should need to get started!
96

edits

Navigation menu