Changes

457 bytes added , 03:08, 28 July 2011

no edit summary

Line 1: Line 1: −

This page documents how to impute 1000 Genome SNPs using MaCH.

+

This page documents how to impute 1000 Genome SNPs using MaCH. It will almost always be more efficient to use [[Minimac]] to carry out imputation using large reference panels, such as the 1000 Genomes Project data. So, you are probably better off looking at the [[Minimac: 1000 Genomes Imputation Cookbook]].

== Getting Started ==

Line 9: Line 9:

Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).

−

The 1000 Genome pilot project genotypes use NCBI Build 36.

+

The 1000 Genome pilot project genotypes use NCBI Build 36; more recent releases use NCBI Build 37.

=== Reference Haplotypes ===

−

Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/~~1000G-2010-03.html~~ MaCH download page]. The most recent set of haplotypes ~~were generated~~ in ~~March 2010 by combining~~ genotype ~~calls generated at the Broad~~, ~~Sanger and the University~~ of ~~Michigan. In our hands~~, ~~this March 2010 release~~ is ~~substantially~~ better ~~than previous~~ 1000 ~~Genome~~ Project ~~genotype call sets~~.

+

Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ MaCH download page]. The most recent set of haplotypes is usually available from the MaCH download page. Improvements in genotype calling and haplotyping methods, together with generation of progressively larger amounts of sequence data mean that, typically, it is better to use the most recent set of 1000 Genomes Project Haplotypes.

== Estimating Model Parameters ==

Line 33: Line 33:

Optionally, you can add the <code>--compact</code> option to the end of the command line to reduce memory use.

−

If you'd like to play it safe and sample 200 individuals at random, you could use [[PedScript] and a series of command similar to this one:

+

If you'd like to play it safe and sample 200 individuals at random, you could use [[PedScript]] and a series of command similar to this one:

Line 46: Line 46:

SAMPLE 200 PERSONS TO chr1.rand200.ped

−

QUIT

+

QUIT

−

EOF

+

EOF

</source>

Line 99: Line 99:

== Quality Filtering ==

−

For 1000 Genome ~~SNPs~~, we ~~currently recommend~~ that any markers with estimated r2 of <0.5 should be treated with caution. This is a bit more conservative than the threshold of 0.3 we ~~recommend~~ for HapMap; but 1000 Genome ~~SNP genotypes are~~ also (as of ~~March~~ 2010) of lower quality.

+

For the original 1000 Genome SNP sets, we typically recommended that any markers with estimated r2 of <0.5 should be treated with caution. This is a bit more conservative than the threshold of 0.3 we recommended for HapMap; but early 1000 Genome Project haplotype sets were also (as of June 2010) of lower quality. In current iterations of the project data, it should be safe to use a less conservative r2 cut-off.

Goncalo

Bureaucrats, Administrators

1,555

edits

Changes

MaCH: 1000 Genomes Imputation Cookbook (view source)

Revision as of 03:08, 28 July 2011

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools