Changes

IMPUTE2: 1000 Genomes Imputation Cookbook (view source)

Revision as of 06:35, 5 August 2011

1,210 bytes removed , 06:35, 5 August 2011

→‎Pre-Phasing using IMPUTE2 =

Line 97: Line 97:

-Ne $NE \

-int $CHUNK_START $CHUNK_END \

−

-o $OUTPUT_FILE \

+

-o $OUTPUT_FILE \

−

-allow_large_regions

+

-allow_large_regions

</source>

−

where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using ~~SGE (sun~~ grid engine) for parallel analysis. The ~~batch file is~~ something like

+

where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using a grid engine for parallel analysis. The series of commands to run would look something like:

Line 109: Line 109:

</source>

−

where sge ~~is SGE~~ command for submitting jobs (~~it may~~ be qsub ~~in your system), the~~ three numbers ~~are chr~~ and ~~the pair of~~ chunk ~~region~~.

+

where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</mosrun>. The three numbers correspond to chromosome and chunk start and end positions.

−

~~Timing: with~~ 30 ~~nodes~~ cluster, ~~it took 7~~ hours ~~for a 1400 samples data~~.

+

On a 30 node cluster, phasing should take approximately 5 hours per 1000 individuals.

−

6. Imputation:

+

= Imputation =

−

There are two choices for imputation step: imputing from best-guess haplotypes~~, or~~ imputing from ~~posterior haplotypes~~.

+

There are two choices for imputation step: imputing from best-guess haplotypes and imputing from a sample of alternate haplotype configurations.

−

Imputing from best-guess haplotypes uses the best-guess haplotypes ~~output from the phasing step. It~~ is ~~simple~~ and ~~faster~~. ~~Refer to the~~ script “prototype_imputation_job_best_guess_haps.sh”:

+

Imputing from best-guess haplotypes uses the best-guess haplotypes is much faster and we recommend it. Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The script has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the interim Phase I haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_interim.html available from the IMPUTE2 website]) and to include the <code>-seed</code> which ensures results can be reproduced by running the script again.

+

#!/bin/bash

#$ -cwd

Line 138: Line 139:

NE=11500

+

## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS

# reference data files

GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt

−

HAPS_FILE=${DATA_DIR}~~EUR.chr~~${CHR}.~~impute~~.~~hap~~

+

HAPS_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.hap.gz

−

LEGEND_FILE=${DATA_DIR}~~EUR.chr~~${CHR}~~.impute~~.legend

+

LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.legend

+

## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT

# best-guess haplotypes from phasing run

GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps

Line 151: Line 154:

## impute genotypes from best-guess GWAS haplotypes

$IMPUTE2_EXEC \

−

-m $GENMAP_FILE \

+

-m $GENMAP_FILE \

−

-known_haps_g $GWAS_BG_HAPS_FILE \

+

-known_haps_g $GWAS_BG_HAPS_FILE \

−

-h $HAPS_FILE \

+

-h $HAPS_FILE \

−

-l $LEGEND_FILE \

+

-l $LEGEND_FILE \

−

-Ne $NE \

+

-Ne $NE \

−

-int $CHUNK_START $CHUNK_END \

+

-int $CHUNK_START $CHUNK_END \

−

-o $OUTPUT_FILE \

+

-o $OUTPUT_FILE \

−

-allow_large_regions \

+

-allow_large_regions \

−

-seed 367946

+

-seed 367946

−

+

</source>

−

~~where option –seed is used to make sure the imputation results can be duplicated.~~

−

Imputing from posterior haplotypes uses the sampled haplotypes from phasing step to impute the untyped SNPs and then average across these imputations. It would take, for example, N times longer than it would take by imputing from best-guess haplotypes, where N = No. iterations – No.burnin if you set option “-thin 1”. Refer to the script “prototype_imputation_job_posterior_sampled_haps.sh”:

−

~~#!/bin/bash~~

−

~~#$ -cwd~~

−

~~CHR=$1~~

−

~~CHUNK_START=`printf "%.0f" $2`~~

−

~~CHUNK_END=`printf "%.0f" $3`~~

−

~~# directories~~

−

~~ROOT_DIR=.~~/

−

~~DATA_DIR=${ROOT_DIR}data_files/~~

−

~~RESULTS_DIR=${ROOT_DIR}results/~~

−

~~HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/~~

−

~~# executable~~

−

~~IMPUTE2_EXEC=${ROOT_DIR}impute2~~

−

~~# parameters~~

−

~~ITER=30~~

−

~~BURNIN=10~~

−

~~THIN=1~~

−

~~NE=11500~~

−

~~# reference data files~~

−

~~GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt~~

−

~~HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap~~

−

~~LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend~~

−

~~# GWAS data files~~

−

~~GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen~~

−

~~# main output file~~

−

~~OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.posterior_sampled_haps_imputation.impute2~~

−

~~## impute genotypes from posterior-sampled GWAS haplotypes~~

−

~~$IMPUTE2_EXEC \~~

−

~~-m $GENMAP_FILE \~~

−

~~-g $GWAS_GTYPE_FILE \~~

−

~~-stage_two \~~

−

~~-hap_samp_dir $HAP_SAMP_DIR \~~

−

~~-h $HAPS_FILE \~~

−

~~-l $LEGEND_FILE \~~

−

~~-iter $ITER \~~

−

~~-burnin $BURNIN \~~

−

~~-thin $THIN \~~

−

~~-Ne $NE \~~

−

~~-int $CHUNK_START $CHUNK_END \~~

−

~~-o $OUTPUT_FILE \~~

−

~~-allow_large_regions \~~

−

~~-seed 367946~~

−

~~Again~~, submit ~~batch~~ jobs ~~using SGE for~~ parallel ~~analysis~~.

+

The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.

−

~~Timing: it took 46 hours for~~ a ~~1400 samples data~~ by ~~imputing from posterior haplotypes with setting –iter 30, -burnin 10 and –thin 1.~~ ~~It would take only about 2~~.~~3 hours (46/20) if imputing from best-guess haplotypes~~.

+

Imputing from a sample of alternate haplotype configurations could be achieved by modifying the “prototype_imputation_job_posterior_sampled_haps.sh” script.

−

~~If using the suggested parameters for IMPUTE2, ie, K~~=~~80, ITER~~=~~30, BURNIN~~=10 and NE=11500, the results using best guess haplotypes and results using posterior haplotypes are extremely similar if well phased haplotypes were generated (see Goncalo’s email on 1st April 2011). Therefore, if timing is a concern, imputing from best guess haplotypes may be your choice.

+

= Association Analysis ==

−

~~7. Final~~ step is to ~~combine~~ the ~~imputed~~ results ~~into the shape for SNPTEST~~. (In order to feed our 30 nodes cluster and do parallel SNPTEST analysis, we combined our imputed results by chromosome and then split each chromosome into 30 partitions. Each partition contains about the same number of SNPs within a chromosome. It takes less than an hour to do a SNPTEST for 1400 samples)

+

If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.

Goncalo

Bureaucrats, Administrators

1,555

edits

Changes

IMPUTE2: 1000 Genomes Imputation Cookbook (view source)

Revision as of 06:35, 5 August 2011

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools