Changes

2,443 bytes added , 11:22, 2 February 2017

→‎How do I get imputation quality estimates?

Line 2: Line 2:

=== minimac ===

−

See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.

+

This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference).

+

The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate].

+

The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.

+

=== MaCH-Admix ===

+

If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].

=== Divide and Conquer ===

Line 37: Line 44:

== Where can I find combined HapMap reference files? ==

−

You can find them at http://~~www~~.sph.umich.edu/~~csg~~/yli/mach/download/HapMap-r21.html or on the HapMap Project website.

+

You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.

== Where can I find HapMap III / 1000 Genomes reference files? ==

−

You can find these at the MaCH download page, which is at http://~~www~~.sph.umich.edu/~~csg~~/yli/mach/download/

+

You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/

== Does --mle overwrite input genotypes? ==

Line 55: Line 62:

Estimated per allele error rate is 0.0293

−

A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use   [http://~~www~~.sph.umich.edu/~~csg/ylwtx~~/CalcMatch~~.1.0.5.tgz~~ CalcMatch ]and [http://~~www~~.sph.umich.edu/~~csg~~/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://~~www~~.sph.umich.edu/~~csg~~/ylwtx/software.html http://~~www~~.sph.umich.edu/~~csg~~/ylwtx/software.html].

+

A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use   [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://csg.sph.umich.edu//ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://csg.sph.umich.edu//ylwtx/software.html http://csg.sph.umich.edu//ylwtx/software.html].

'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.

Line 62: Line 69:

In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.

−

http://~~www~~.sph.umich.edu/~~csg~~/yli/figure3.gif

+

http://csg.sph.umich.edu//yli/figure3.gif

Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.

Line 70: Line 77:

We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes >70% of poorly-imputed SNPs at the cost of <0.5% well-imputed SNPs) and MAF of 1%.

−

== How do I get reference files for an region of interest? ==

+

== How do I get reference files for an region of interest? ==

−

1. For HapMapII format, download haplotypes from http://~~www~~.sph.umich.edu/~~csg~~/ylwtx/HapMapForMach.tgz

+

Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. 1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz 2. For MACH format, you can do the following:

−

2. For MACH format, you can do the following:

*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.

−

*Then:

+

*Then, under csh:

+

@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`

+

@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`

+

under bash:

+

first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`

+

last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`

+

*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace

+

head -1 orig.hap | wc -w

+

Note: if the haplotypes are gz compressed, do:

+

zcat orig.hap.gz | head -1 | wc -w

−

~~@ first = `grep~~ -~~n rsFIRST orig~~.~~snps | cut -f1 -d ':'`~~

+

* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):

−

~~@ last = `grep -n rsLAST orig.snps | cut -f1 -d '~~:'`

−

*Finally (assuming the third field contains the actual haplotypes, where alleles are separated by whitespace):

+

awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} > region.hap

−

awk '{print $3}' ~~orig.hap~~ | cut -c${first}-${last} > region.hap

+

Note: if the haplotypes are gz compressed, do:

+

zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} > region.hap

The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.

Line 89: Line 105:

== Do I always have to sort the pedigree file by marker position? ==

−

If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct ~~order. **HOWEVER**, you will probably avoid problems by including markers in the pedigree file sorted in chromosome~~ order.

+

If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.

== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==

Line 199: Line 215:

New executables mach1 and thunder will then be generated under folder executables/

−

== Install ~~MaCh~~ ==

+

== Install MaCH ==

−

We have source codes available through the MaCH download page: http://~~www~~.sph.umich.edu/~~csg~~/yli/mach/download/

+

We have source codes available through the MaCH download page: http://csg.sph.umich.edu//yli/mach/download/

== More questions? ==

Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].

Ppwhite

96

edits

Changes

MaCH FAQ (view source)

Revision as of 11:22, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools