Difference between revisions of "Minimac3 Imputation Cookbook"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 114: Line 114:
 
= Download =
 
= Download =
  
'''Minimac3 ''' is available as an undocumented release version. The source files (and binary executable) are available for download in  [[Minimac3#Download | Source Files]] and commonly used reference panels in VCF and <font face=Courier>M3VCF</font> formats are available for download in [[#Reference Panels for Download | Reference Panels]].
+
'''Minimac3 ''' is available as an undocumented release version. The source files (and binary executable) are available for download in  [[Minimac3#Download | Source Files]] and commonly used reference panels in VCF and <font face=Courier>M3VCF</font> formats are available for download in [[Minimac3#Reference Panels for Download | Reference Panels]].
  
 
= Contact =
 
= Contact =
  
 
In case of any queries and bugs please contact [mailto:sayantan@umich.edu Sayantan Das].
 
In case of any queries and bugs please contact [mailto:sayantan@umich.edu Sayantan Das].

Revision as of 20:13, 29 January 2015

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.

This wiki page is designed to give users a detailed step-by-step description on running typical GWAS imputation experiments.

Imputation Cookbook

This section gives a brief summary of the steps required to go through an experiment of imputation on typical GWAS samples. Before pre-phasing and imputation, users must ensure that their data is quality controlled. Standard quality control filters involve excluding markers with high missingness rate, high deviations from Hardy-Weinberg equilibrium, high discordance rates (if duplicate copies available), excess Mendelian inconsistencies etc. and removing samples with high missingness rate, unusual heterozygosity, high inbreeding coefficient, clear evidence of being genetic ancestry outliers, evidence of relatedness etc. All of these steps can be easily carried out using PLINK. With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.

Once a quality controlled dataset is available we need to pre-phase the data followed by imputation. The steps are explained below.

Pre-Phasing the GWAS data

Pre-Phasing can be done using either MaCH or SHAPEIT.

MaCH

MaCH is a Markov Chain based haplotyper. It can resolve long haplotypes in samples of unrelated individuals. The source code is available for download here. Check out their home-page for further details.

A typical command line to phase using MaCH looks like this (Gwas.chr20.Unphased.dat and Gwas.chr20.Unphased.ped is the quality controlled GWAS data set in Merlin format)

mach1 -d Gwas.chr20.Unphased.dat \
      -p Gwas.chr20.Unphased.ped \
      --rounds 20 \
      --states 200 \
      --phase \
      --interim 5 \
      --sample 5 \
      --prefix Gwas.Chr20.Phased.Output

SHAPEIT

SHAPEIT is a fast and accurate method for estimation of haplotypes (phasing) from genotype or sequencing data. The source code is available for download here. Check out their home-page for further details. It can be used to phase a small number of samples (a reference panel required) as well as a large number of samples (NO reference panel required). The reference panels and genetic map files required by SHAPEIT are available for download here.

  • The following example shows a typical SHAPEIT command line to phase a LARGE number (>200) of GWAS samples (Gwas.chr20.Unphased.vcf is the quality controlled GWAS data set in VCF format).
shapeit -V Gwas.chr20.Unphased.vcf \
        -M genetic_map_chr20.txt \
        -O Gwas.Chr20.Phased.Output
  • The following example shows a typical SHAPEIT command line to phase a SMALL number (<200) of GWAS samples (Gwas.chr20.Unphased.vcf is the quality controlled GWAS data set in VCF format).
## The following step splits out variants mis-aligned between the reference and gwas panel
shapeit -check \
        -V Gwas.chr20.Unphased.vcf\
        -M genetic_map_chr20.txt \
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
        --output-log gwas.alignments

## The following step phases gwas panel using the reference panel while excluding the markers found in the step above.
shapeit -B gwas \
        -V Gwas.chr20.Unphased.vcf \
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
        --exclude-snp gwas.alignments.strand.exclude \
        -O Gwas.Chr20.Phased.Output

Running Imputation

After the pre-phasing has been done, we can begin to run the imputation. But before that,we need to convert our phased GWAS panel files (obtained above) to VCF format (since Minimac3 can only use VCF format files) and also download the reference panels required for imputation. Consequently, we would have the following steps.

Convert GWAS Panel Files into VCF

If pre-phased GWAS data is available in VCF format, users can skip this step. Otherwise, the following steps show how to convert other format files to VCF format.

  • PLINK: Use PLINK2 (available here) as follows:
plink --bfile Gwas.Chr20.Phased.Output \
      --recode vcf \
      --out Gwas.Chr20.Phased.Output.VCF.format
  • MaCH: Use Mach2VCF (coming soon) as follows:
mach2VCF --haps Gwas.Chr20.Phased.Output.hap \
         --snps Gwas.Chr20.Phased.Output.snps \
         --prefix Gwas.Chr20.Phased.Output.VCF.format
  • SHAPEIT: Use SHAPEIT (available here) as follows:
shapeit -convert \
        --input-haps Gwas.Chr20.Phased.Output \
        --output-vcf Gwas.Chr20.Phased.Output.VCF.format.vcf

Download Reference Panel

Commonly used reference panels are 1000 Genomes Phase 3 (2,535 samples), 1000 Genomes Phase 1 (1,094 samples), HapMap2 (269 samples), Haplotype Reference Consortium (32,914 samples) etc. Users are advised to use either 1000 Genomes Phase 3 (available for download in Reference Panels ) or the Haplotype Reference Consortium (which due to data privacy issues cannot be shared publicly but can be used for imputation remotely on a server through a imputation server setup at University of Michigan). Reference panels for different versions of 1000 Genomes, in both VCF and M3VCF format, are available for download in Reference Panels.

Impute Samples

The final step for imputation involves running Minimac3 to perform the imputation analysis. Now that we have the pre-phased GWAS panel (in VCF format) and the appropriate reference panel (in VCF or M3VCF format), we can run Minimac3 as follows. In the following examples, the first one uses a VCF file for reference (that can be obtained as explained above) and the second example uses a M3VCF file (that might have been downloaded from the links below or created on a previous run of Minimac3).

../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.vcf \ 
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
                --prefix Gwas.Chr20.Imputed.Output
../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.m3vcf \ 
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
                --prefix Gwas.Chr20.Imputed.Output

Chromosome X Imputation

Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately.

The following example illustrates imputation on the non-PAR of chromosome X for males and females separately (files available in Minimac3/test/ directory)

Male Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.males.vcf --prefix testRun

Female Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.females.vcf --prefix testRun

NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.

Download

Minimac3 is available as an undocumented release version. The source files (and binary executable) are available for download in Source Files and commonly used reference panels in VCF and M3VCF formats are available for download in Reference Panels.

Contact

In case of any queries and bugs please contact Sayantan Das.