IMPUTE2: 1000 Genomes Imputation Cookbook
Introduction
Authors
This page is based on a document prepared by Jian'an Luan, Alexander Teumer, Jing-Hua Zhao, Christian Fuchsberger and Cristen Willer for the GIANT Consortium.
Content
This page documents how to carry out imputation using IMPUTE2 software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes. For information on how to carry out 1000 Genomes Imputation using Minimac, see Minimac:_1000_Genomes_Imputation_Cookbook.
As of now, the most recent version of the 1000 Genome Project haplotypes are the Phase I Integrated release haplotypes (available in IMPUTE2 format from this link). This haplotype set is more complete and more accurate than prior 1000 Genome Project haplotypes and is recommended for all imputation based analyses.
Before Imputation
Quality Control of Genotype Data
Before you start, you should apply appropriate quality control to your genotype data. This typically includes sample level quality control (examining call rate, heterozygosity, relatedness between genotyped individuals, and correspondence between sex chromosome genotypes and reported gender) and marker level quality control (examining call rates and deviations from Hardy-Weinberg Equilibrium and, for older genotyping platforms, excluding low frequency SNPs).
A good source of information on quality control checks for genomewide association data is:
Weale M (2010) Quality Control for Genome-Wide Association Studies. Methods Mol. Biol. 628:341–372 (in Barnes MB & Breen G (eds) Genetic Variation-Methods and Protocols, Chapter 19, Humana Press 2010) with code available from http://sites.google.com/site/mikeweale/software/gwascode
Convert Genotype Data to Build 37
Current releases of 1000 Genome Project data use NCBI genome build 37 (hg19) and, before you start imputation, you need to ensure that all your genotypes are reported using build 37 coordinates and on the forward strand of the reference genome.
The online LiftOver tool can convert data from earlier genome builds to build 37. This tool re-maps only coordinates, but not SNP identifiers. Before using the tool, you may have to look-up a dbSNP merge table (table description on the NCBI website) to account for any changes in SNP rs# between builds. It is normal for a few SNPs to fail LiftOver. Some of these fail because they cannot be mapped unambiguously by NCBI (for example rs1131012); these should be dropped from imputation. Sometimes, a few of these failed SNPs can be rescued by manually looking up their coordinates but, because the number of affected SNPs is typically very small, this step manual rescue step is not recommended.
After LiftOver, it is important to ensure that all genotypes are reported on the forward strand. This strand flipping can be facilitated by tools such as PLINK and GTOOL.
Convert Genotype Files Into IMPUTE format
After LiftOver to build 37 and ensuring that alleles are reported on the forward strand, you will have convert input files into IMPUTE format, one per chromosome. Before conversion, remember to sort SNPs by position because LiftOver from earlier genome builds can sometimes change marker order.
GTOOL can be used to convert data from PLINK PED format to IMPUTE format.
If you use SHAPEIT for pre-phasing (see next section), you do not have to convert autosomal chromosomes to IMPUTE format because SHAPEIT can take input data in either PLINK PED or BED format, or IMPUTE format.
Pre-Phasing (autosomal chromosomes only)
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.
Drs. Bryan Howie and Jonathan Marchini provide descriptive and examples of this two step imputation process on their website.
There are two programs that can be used for pre-phasing: IMPUTE2 or SHAPEIT. Drs. Bryan Howie and Jonathan Marchini recommend SHAPEIT which uses an accurate phasing method. Output from SHAPEIT is in IMPUTE2 format for next imputation step.
Sliding Window Analyses
The IMPUTE2 authors recommend that chromosomes should analyzed in blocks of a few megabases each. In a cluster environment, these blocks can be conveniently analyzed at the same time. You won't need to physically split the genotype data - instead, you can define the region to be analyzed using the -int start stop
command line option. For example, to analyze a 5-Mb region starting at 10-Mb along the chromosome, you would specify -int 10000001 15000000
.
There are two small wrinkles to consider. First, if the chunks are all of the same size (in Mb), it is possible that a few chunks will have too many (or too few) SNPs to analyze. Second, in intervals that cross the centromere (which typically corresponds to large gap in 1000 Genome SNPs), it may happen that only a few SNPs from one arm are chunked together with SNPs from a different chromosome arm. To overcome these two concerns, we suggest that you merge adjacent chunks when one has very few (say, less than 200) SNPs in a chunk and that you try to avoid creating region splits that span the centromere. If a merged chunk extends further than 7-Mb, you will have to use the setting -allow_large_regions
when you run IMPUTE2.
Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
Pre-Phasing using IMPUTE2
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named prototype_phasing_job.sh(based on the example script and an outdated document on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called data_files
, with results in results
directory. It has been modified to specify the build 37 recombination map should be used and to include the -allow_large_regions
parameter and to set the effective population size NE=20,000 (as recommended in the IMPUTE2 website for the more recent 1000 Genome Project builds).
#!/bin/bash
#$ -cwd
CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
# directories
ROOT_DIR=./
DATA_DIR=${ROOT_DIR}data_files/
RESULTS_DIR=${ROOT_DIR}results/
# executable (must be IMPUTE version 2.2.0 or later)
IMPUTE2_EXEC=${ROOT_DIR}impute2
# parameters
NE=20000
# reference data files
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
# GWAS data files
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
# main output file
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2
## phase GWAS genotypes
$IMPUTE2_EXEC \
-prephase_g \
-m $GENMAP_FILE \
-g $GWAS_GTYPE_FILE \
-Ne $NE \
-int $CHUNK_START $CHUNK_END \
-o $OUTPUT_FILE \
-allow_large_regions
Then you can submit batch jobs using a grid engine for parallel analysis. The series of commands to run would look something like:
sge ./prototype_phasing_job.sh 1 1 5000000
sge ./prototype_phasing_job.sh 1 5000001 10000000
sge ./prototype_phasing_job.sh 1 10000001 15000000
where sge
should be replaced with the appropriate command for submitting jobs to your cluster (sge
applies to sun grid engine, other common choices might be qsub
and mosrun
). The three numbers correspond to chromosome and chunk start and end positions.
Pre-Phasing using SHAPEIT (recommended)
It is recommended to phase a whole chromosome using SHAPEIT. The script below, named shapeit_phasing_job.sh(based on the example script on the SHAPEIT website), illustrates how to do this. The input GWAS data in the script below is in PLINK BED format, while data in PLINK PED format and IMPUTE format are also accepted. It requires one parameter (chromosome number). The option parameters --effective-size 11418 and --thread 8 reflect an example European population and 8-core computer.
#!/bin/bash
#$ -cwd
CHR=$1
# directories
ROOT_DIR=./
DATA_DIR=${ROOT_DIR}data_files/
RESULTS_DIR=${ROOT_DIR}results/
# executable
SHAPEIT_EXEC=${ROOT_DIR}shapeit
# parameters
THREAT=4
EFFECTIVESIZE=11418
# reference data files
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
# GWAS data files in PLINK BED format
GWASDATA_BED=${DATA_DIR}gwas_data_chr${CHR}.bed
GWASDATA_BIM=${DATA_DIR}gwas_data_chr${CHR}.bim
GWASDATA_FAM=${DATA_DIR}gwas_data_chr${CHR}.fam
# main output file
OUTPUT_HAPS=${RESULTS_DIR}chr${CHR}.haps
OUTPUT_SAMPLE=${RESULTS_DIR}chr${CHR}.sample
OUTPUT_LOG=${RESULTS_DIR}chr${CHR}.log
## phase GWAS genotypes
$SHAPEIT_EXEC \
--input-bed $GWASDATA_BED $GWASDATA_BIM $GWASDATA_FAM \
--input-map $GENMAP_FILE \
--thread $THREAT \
--effective-size $EFFECTIVESIZE \
--output-max $OUTPUT_HAPS $OUTPUT_SAMPLE \
--output-log $OUTPUT_LOG
Similarly, you can submit batch jobs. The series of commands to run would look something like:
sge ./shapeit_phasing_job.sh 1
sge ./shapeit_phasing_job.sh 2
sge ./shapeit_phasing_job.sh 3
Imputation
Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The original script is included in the package of IMPUTE2 examples. This version has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the Phase I integrated haplotypes available from the IMPUTE2 website). We recommend to use the "ALL (macGT1)" reference haplotypes and to include the -seed
parameter, which ensures results can be reproduced by running the script again.
#!/bin/bash
#$ -cwd
CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
# directories
ROOT_DIR=./
DATA_DIR=${ROOT_DIR}data_files/
RESULTS_DIR=${ROOT_DIR}results/
# executable (must be IMPUTE version 2.2.0 or later)
IMPUTE2_EXEC=${ROOT_DIR}impute2
# parameters
NE=20000
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
# reference data files
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.hap.gz
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.legend.gz
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
## SELECT ONE FROM METHOD-A AND METHOD-B BELOW
# METHOD-A: haplotypes from IMPUTE2 phasing run
GWAS_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
# METHOD-B: haplotypes from SHAPEIT phasing run
GWAS_HAPS_FILE=${RESULTS_DIR}chr${CHR}.haps
# main output file
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.impute2
## impute genotypes from GWAS haplotypes
$IMPUTE2_EXEC \
-m $GENMAP_FILE \
-known_haps_g $GWAS_HAPS_FILE \
-h $HAPS_FILE \
-l $LEGEND_FILE \
-Ne $NE \
-int $CHUNK_START $CHUNK_END \
-o $OUTPUT_FILE \
-allow_large_regions \
-seed 367946
Filtering Option -filt_rules_l is used in this example to remove reference SNPs with MAF=0 in European reference panel (for example, a European cohort).
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.
X-Chromosome Imputation
Please refer to IMPUTE2 example for chromosome X imputation and IMPUTE2 chromosome X options. Please note the difference on the pseudoautosomal and non-pseudoautosomal regions.
Association Analysis
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.
If you are running SNP test, two useful pieces of advice are to:
- Concatenate the dosage files for each chromosome (so that you have a 22 files to manage rather than several hundred).
- Remember that the per SNP imputation accuracy score (IMPUTE's INFO field) is calculate by SNPTEST and will be present in the SNPTEST output.