IMPUTE2: 1000 Genomes Imputation Cookbook

From Genome Analysis Wiki
Jump to navigationJump to search

Introduction

Authors

This page is based on a document prepared by Jian'an Luan, Alexander Teumer, Jing-Hua Zhao, Christian Fuchsberger and Cristen Willer for the GIANT Consortium.

Content

This page documents how to carry out imputation using IMPUTE2 software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.

Before Imputation

Quality Control of Genotype Data

Before you start, you should apply appropriate quality control to your genotype data. This typically includes sample level quality control (examining call rate, heterozygosity, relatedness between genotyped individuals, and correspondence between sex chromosome genotypes and reported gender) and marker level quality control (examining call rates and deviations from Hardy-Weinberg Equilibrium and, for older genotyping platforms, excluding low frequency SNPs).

A good source of information on quality control checks for genomewide association data is:

Weale M (2010) Quality Control for Genome-Wide Association Studies. Methods Mol. Biol. 628:341–372 (in Barnes MB & Breen G (eds) Genetic Variation-Methods and Protocols, Chapter 19, Humana Press 2010) with code available from http://sites.google.com/site/mikeweale/software/gwascode

Convert Genotype Data to Build 37

Current releases of 1000 Genome Project data use NCBI genome build 37 (hg19) and, before you start imputation, you need to ensure that all your genotypes are reported using build 37 coordinates and on the forward strand of the reference genome.

The online LiftOver tool can convert data from earlier genome builds to build 37. This tool re-maps only coordinates, but not SNP identifiers. Before using the tool, you may have to look-up a dbSNP merge table (table description on the NCBI website) to account for any changes in SNP rs# between builds. It is normal for a few SNPs to fail LiftOver. Some of these fail because they cannot be mapped unambiguously by NCBI (for example rs1131012); these should be dropped from imputation. Sometimes, a few of these failed SNPs can be rescued by manually looking up their coordinates but, because the number of affected SNPs is typically very small, this step manual rescue step is not recommended.

After LiftOver, it is important to ensure that all genotypes are reported on the forward strand. This strand flipping can be facilitated by tools such as PLINK and GTOOL.

Convert Genotype Files Into IMPUTE format

After LiftOver to build 37 and ensuring that alleles are reported on the forward strand, you will have convert input files into IMPUTE format, one per chromosome. Before conversion, remember to sort SNPs by position because LiftOver from earlier genome builds can sometimes change marker order.

GTOOL can be used to convert data from PLINK PED format to IMPUTE format.

Pre-Phasing

With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.

Drs. Bryan Howie and Jonathan Marchini provide examples of this two step imputation process on their website.

Sliding Window Analyses

The IMPUTE2 authors recommend that chromosomes should analyzed in blocks of a few megabases each. In a cluster environment, these blocks can be conveniently analyzed at the same time. You won't need to physically split the genotype data - instead, you can define the region to be analyzed using the -int start stop command line option. For example, to analyze a 5-Mb region starting at 10-Mb along the chromosome, you would specify -int 10000001 15000000.

There are two small wrinkles to consider. First, if the chunks are all of the same size (in Mb), it is possible that a few chunks will have too many (or too few) SNPs to analyze. Second, in intervals that cross the centromere (which typically corresponds to large gap in 1000 Genome SNPs), it may happen that only a few SNPs from one arm are chunked together with SNPs from a different chromosome arm. To overcome these two concerns, we suggest that you merge adjacent chunks when one has very few (say, less than 200) SNPs in a chunk and that you try to avoid creating region splits that span the centromere. If a merged chunk extends further than 7-Mb, you will have to use the setting -allow_large_regions when you run IMPUTE2.

Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz

Pre-Phasing using IMPUTE2 =

You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named prototype_phasing_job.sh(based on the example script on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called data_files, with results in results directory and estimated haplotypes in a sampled_haps directory. It has been modified to specify the build 37 recombination map should be used and to include the -allow_large_regions parameter.

#!/bin/bash
#$ -cwd

CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`

# directories
ROOT_DIR=./
DATA_DIR=${ROOT_DIR}data_files/
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/

# executable
IMPUTE2_EXEC=${ROOT_DIR}impute2

# parameters
K=80
ITER=30
BURNIN=10
NE=11500

# reference data files
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt

# GWAS data files
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen

# main output file
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2

## phase GWAS genotypes
$IMPUTE2_EXEC \
    -m $GENMAP_FILE \
    -g $GWAS_GTYPE_FILE \
    -phase \
    -include_buffer_in_output \
    -stage_one \
    -hap_samp_dir $HAP_SAMP_DIR \
    -k $K \
    -iter $ITER \
    -burnin $BURNIN \
    -Ne $NE \
    -int $CHUNK_START $CHUNK_END \
    -o $OUTPUT_FILE \
    -allow_large_regions

where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using a grid engine for parallel analysis. The series of commands to run would look something like:

sge ./prototype_phasing_job.sh 1 1 5000000
sge ./prototype_phasing_job.sh 1 5000001 10000000
sge ./prototype_phasing_job.sh 1 10000001 15000000

where sge should be replaced with the appropriate command for submitting jobs to your cluster (sge applies to sun grid engine, other common choices might be qsub and mosrun</mosrun>. The three numbers correspond to chromosome and chunk start and end positions.

On a 30 node cluster, phasing should take approximately 5 hours per 1000 individuals.

Imputation

There are two choices for imputation step: imputing from best-guess haplotypes and imputing from a sample of alternate haplotype configurations.

Imputing from best-guess haplotypes uses the best-guess haplotypes is much faster and we recommend it. Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The script has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the interim Phase I haplotypes available from the IMPUTE2 website) and to include the -seed which ensures results can be reproduced by running the script again.

#!/bin/bash
#$ -cwd

CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`

# directories
ROOT_DIR=./
DATA_DIR=${ROOT_DIR}data_files/
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/

# executable
IMPUTE2_EXEC=${ROOT_DIR}impute2

# parameters
NE=11500

## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
# reference data files
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.hap.gz
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.legend

## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
# best-guess haplotypes from phasing run
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps

# main output file
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.best_guess_haps_imputation.impute2

## impute genotypes from best-guess GWAS haplotypes
$IMPUTE2_EXEC \
   -m $GENMAP_FILE \
   -known_haps_g $GWAS_BG_HAPS_FILE \
   -h $HAPS_FILE \
   -l $LEGEND_FILE \
   -Ne $NE \
   -int $CHUNK_START $CHUNK_END \
   -o $OUTPUT_FILE \
   -allow_large_regions \
   -seed 367946

The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.

Imputing from a sample of alternate haplotype configurations could be achieved by modifying the “prototype_imputation_job_posterior_sampled_haps.sh” script.

Association Analysis =

If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.