IMPUTE2: 1000 Genomes Imputation Cookbook

From Genome Analysis Wiki
Jump to navigationJump to search

Introduction

Authors

This page is based on a document prepared by Jian'an Luan, Alexander Teumer, Jing-Hua Zhao, Christian Fuchsberger and Cristen Willer for the GIANT Consortium.

Content

This page documents how to carry out imputation using IMPUTE2 software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.

Before Imputation

Quality Control of Genotype Data

Before you start, you should apply appropriate quality control to your genotype data. This typically includes sample level quality control (examining call rate, heterozygosity, relatedness between genotyped individuals, and correspondence between sex chromosome genotypes and reported gender) and marker level quality control (examining call rates and deviations from Hardy-Weinberg Equilibrium and, for older genotyping platforms, excluding low frequency SNPs).

A good source of information on quality control checks for genomewide association data is:

Weale M (2010) Quality Control for Genome-Wide Association Studies. Methods Mol. Biol. 628:341–372 (in Barnes MB & Breen G (eds) Genetic Variation-Methods and Protocols, Chapter 19, Humana Press 2010) with code available from http://sites.google.com/site/mikeweale/software/gwascode

Convert Genotype Data to Build 37

Current releases of 1000 Genome Project data use NCBI genome build 37 (hg19) and, before you start imputation, you need to ensure that all your genotypes are reported using build 37 coordinates and on the forward strand of the reference genome.

The online LiftOver tool can convert data from earlier genome builds to build 37. This tool re-maps only coordinates, but not SNP identifiers. Before using the tool, you may have to look-up a dbSNP merge table (table description on the NCBI website) to account for any changes in SNP rs# between builds. It is normal for a few SNPs to fail LiftOver. Some of these fail because they cannot be mapped unambiguously by NCBI (for example rs1131012); these should be dropped from imputation. Sometimes, a few of these failed SNPs can be rescued by manually looking up their coordinates but, because the number of affected SNPs is typically very small, this step manual rescue step is not recommended.

After LiftOver, it is important to ensure that all genotypes are reported on the forward strand. This strand flipping can be facilitated by tools such as PLINK and GTOOL.

Convert Genotype Files Into IMPUTE format

After LiftOver to build 37 and ensuring that alleles are reported on the forward strand, you will have convert input files into IMPUTE format, one per chromosome. Before conversion, remember to sort SNPs by position because LiftOver from earlier genome builds can sometimes change marker order.

GTOOL can be used to convert data from PLINK PED format to IMPUTE format.

Pre-Phasing