Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1: −
This page documents how to impute 1000 Genome SNPs using MaCH.  
+
This page documents how to impute 1000 Genome SNPs using MaCH. It will almost always be more efficient to use [[Minimac]] to carry out imputation using large reference panels, such as the 1000 Genomes Project data. So, you are probably better off looking at the [[Minimac: 1000 Genomes Imputation Cookbook]].
    
== Getting Started ==
 
== Getting Started ==
Line 9: Line 9:  
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
 
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
   −
The 1000 Genome pilot project genotypes use NCBI Build 36.
+
The 1000 Genome pilot project genotypes use NCBI Build 36; more recent releases use NCBI Build 37.
    
=== Reference Haplotypes ===
 
=== Reference Haplotypes ===
   −
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/1000G-2010-03.html MaCH download page]. The most recent set of haplotypes were generated in March 2010 by combining genotype calls generated at the Broad, Sanger and the University of Michigan. In our hands, this March 2010 release is substantially better than previous 1000 Genome Project genotype call sets.
+
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ MaCH download page]. The most recent set of haplotypes is usually available from the MaCH download page. Improvements in genotype calling and haplotyping methods, together with generation of progressively larger amounts of sequence data mean that, typically, it is better to use the most recent set of 1000 Genomes Project Haplotypes.
    
== Estimating Model Parameters ==
 
== Estimating Model Parameters ==
Line 33: Line 33:  
Optionally, you can add the <code>--compact</code> option to the end of the command line to reduce memory use.  
 
Optionally, you can add the <code>--compact</code> option to the end of the command line to reduce memory use.  
   −
If you'd like to play it safe and sample 200 individuals at random, you could use [[PedScript] and a series of command similar to this one:
+
If you'd like to play it safe and sample 200 individuals at random, you could use [[PedScript]] and a series of command similar to this one:
    
<source lang="text">
 
<source lang="text">
Line 46: Line 46:  
   SAMPLE 200 PERSONS TO chr1.rand200.ped
 
   SAMPLE 200 PERSONS TO chr1.rand200.ped
   −
    QUIT
+
  QUIT
    EOF
+
  EOF
 
</source>
 
</source>
   Line 99: Line 99:  
== Quality Filtering ==
 
== Quality Filtering ==
   −
For 1000 Genome SNPs, we currently recommend that any markers with estimated r<sup>2</sup> of <0.5 should be treated with caution. This is a bit more conservative than the threshold of 0.3 we recommend for HapMap; but 1000 Genome SNP genotypes are also (as of March 2010) of lower quality.
+
For the original 1000 Genome SNP sets, we typically recommended that any markers with estimated r<sup>2</sup> of <0.5 should be treated with caution. This is a bit more conservative than the threshold of 0.3 we recommended for HapMap; but early 1000 Genome Project haplotype sets were also (as of June 2010) of lower quality. In current iterations of the project data, it should be safe to use a less conservative r<sup>2</sup> cut-off.

Navigation menu