Difference between revisions of "Minimac3"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 42: Line 42:
 
                 --prefix testRun
 
                 --prefix testRun
  
Here <code>refPanel.vcf</code> is the reference panel used in VCF format (e.g. 1000 Genomes), <code>targetStudy.vcf</code> is the phased GWAS data in VCF format and <code>testRun</code> is the prefix for the output files. Some commonly used reference panels are available for download in [[#Reference Panels for Download| Reference Panels]]. See wiki page on [[Minimac3 Usage| Detailed Usage]] and [[Minimac3 Imputation Cookbook|Imputation Cookbook]] for further details on using '''Minimac3''' for imputation analysis.
+
Here <code>refPanel.vcf</code> is the reference panel used in VCF format (e.g. 1000 Genomes), <code>targetStudy.vcf</code> is the phased GWAS data in VCF format, and <code>testRun</code> is the prefix for the output files. Some commonly used reference panels are available for download in [[#Reference Panels for Download| Reference Panels]]. See wiki page on [[Minimac3 Usage| Detailed Usage]] and [[Minimac3 Imputation Cookbook|Imputation Cookbook]] for further details on using '''Minimac3''' for imputation analysis.
 
   
 
   
 
Users can always type the following for further support:
 
Users can always type the following for further support:

Revision as of 00:23, 29 January 2015

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH). minimac3 is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy. This algorithm analyzes only the unique sets of haplotypes in small genomic segments, thereby saving on time-complexity, computational memory but no loss in degree of accuracy.

Minimac3, apart from performing imputation, also creates M3VCF files (customized minimac3 VCF files) which are able to store reference panel information in a compact form, thus saving on memory and time required to read large datasets. User will have an option to use the binary code to either just convert VCF files to M3VCF files or to perform imputation as well. The code can also take a previously generated M3VCF file as input for the reference panel. M3VCF files can also store pre-calculated estimates of recombination fraction and error, which can be used for later runs of imputation. The latest version of Minimac3 also allows output in the form of VCF files for easier data manipulation in downstream analysis.

Download

Minimac3 is available as an undocumented release version. The source files are available for download here and commonly used reference panels in M3VCF format are available for download in Reference Panels. The authors would really appreciate if users would use it on their data set and let us know of possible bugs to be fixed.

  • To Download Minimac3
Description Download Link
Minimac3 Executable UNIX Users
Minimac3-omp Executable (for parallel computing) UNIX Users
Minimac3 Source Files UNIX Users

Usage

Users should follow the following steps to compile Minimac3 (if they downloaded the source files) or should skip them (if they downloaded the binary executable).

## EXTRACT MINIMAC3 AND COMPILE
 
tar -xzvf Minimac3.v1.0.0.tar.gz
cd Minimac3/
make

A typical Minimac3 command line for imputation is as follows

../bin/Minimac3 --refHaps refPanel.vcf \ 
                --haps targetStudy.vcf \
                --prefix testRun

Here refPanel.vcf is the reference panel used in VCF format (e.g. 1000 Genomes), targetStudy.vcf is the phased GWAS data in VCF format, and testRun is the prefix for the output files. Some commonly used reference panels are available for download in Reference Panels. See wiki page on Detailed Usage and Imputation Cookbook for further details on using Minimac3 for imputation analysis.

Users can always type the following for further support:

 /bin/Minimac3 --help

Examples

To look at the examples, the folder of Minimac3 needs to be copied to the users local directory first. Then move to the folder LocalDirectory/test/

 cp -r /net/fantasia/home/sayantan/Softwares/Minimac3/ LocalDirectoryMinimac3/
 cd LocalDirectoryMinimac3/test/

The following example uses a VCF reference file [refPanel.vcf] and a VCF target sample file [targetStudy.vcf]

 ../bin/Minimac3 --refHaps refPanel.vcf \ 
                 --haps targetStudy.vcf \
                 --prefix testRun

The following example is same as above but uses minimac3-omp (which is implemented using openMP programming enabling parallel computing).

 ../bin/Minimac3-omp --refHaps refPanel.vcf \
                     --haps targetStudy.vcf \
                     --prefix testRun \
                     --cpus 5

The following example converts a VCF reference file into M3VCF (only). It also does parameter estimation based on the reference panel using leave-one-out method and saves them in the M3VCF file. The parameter estimation can be skipped with "--rounds = 0". If the option "--processReference" is ON, no imputation will be done, only compression of file from VCF to M3VCF format will be done.

../bin/Minimac3 --refHaps refPanel.vcf \ 
                --processReference \ 
                --prefix testRun

The following example uses a M3VCF file (which was created in the previous example) and VCF target sample files (targetStudy.vcf) for imputation.

../bin/Minimac3 --refHaps testRun.m3vcf.gz \ 
                --haps targetStudy.vcf \ 
                --prefix testRun

[NOTE: In the example above, if testRun.m3vcf.gz was created with rounds = 0, it would contain no parameter estimates. Note that the program works with the saved estimates when available (as in the example above), whereas it does parameter estimation when the estimates are NOT available (as in the example below which is created with rounds = 0)]

../bin/Minimac3 --refHaps refPanel.vcf \ 
                --processReference \ 
                --rounds 0 \ 
                --prefix testRun
../bin/Minimac3 --refHaps testRun.m3vcf.gz \ 
                --haps targetStudy.vcf \ 
                --prefix testRun

The following example also uses a M3VCF reference file [refPanel.m3vcf.gz] and a VCF target sample file [targetStudy.vcf]. However, it only analyzes chromosome 6 from position 505988 to 873131 (allowing a buffer of 100 bp on either side). It also outputs a phased haplotype file (using --hapOutput, option) and the usual dosage file (using --doseOutput, option)

../bin/Minimac3 --refHaps testRun.m3vcf.gz \ 
                --haps targetStudy.vcf \
                --chr 6 \ 
                --start 505988 \ 
                --end 873131 \ 
                --window 100 \  
                --prefix testRun \ 
                --hapOutput \ 
                --doseOutput

For examples on imputation of chromosome X, see Chromosome X Imputation

Imputation Cookbook

This section gives a brief summary of the steps required to go through an experiment of imputation on typical GWAS samples. Before pre-phasing and imputation, users must ensure that their data is quality controlled. Standard quality control filters involve excluding markers with high missingness rate, high deviations from Hardy-Weinberg equilibrium, high discordance rates (if duplicate copies available), excess Mendelian inconsistencies etc. and removing samples with high missingness rate, unusual heterozygosity, high inbreeding coefficient, clear evidence of being genetic ancestry outliers, evidence of relatedness etc. All of these steps can be easily carried out using PLINK. With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.

Once a quality controlled dataset is available we need to pre-phase the data followed by imputation. The steps are explained below.

Pre-Phasing the GWAS data

Pre-Phasing can be done using either MaCH or SHAPEIT.

MaCH

MaCH is a Markov Chain based haplotyper. It can resolve long haplotypes in samples of unrelated individuals. The source code is available for download here. Check out their home-page for further details.

A typical command line to phase using MaCH looks like this (Gwas.chr20.Unphased.dat and Gwas.chr20.Unphased.ped is the quality controlled GWAS data set in Merlin format)

mach1 -d Gwas.chr20.Unphased.dat \
      -p Gwas.chr20.Unphased.ped \
      --rounds 20 \
      --states 200 \
      --phase \
      --interim 5 \
      --sample 5 \
      --prefix Gwas.Chr20.Phased.Output

SHAPEIT

SHAPEIT is a fast and accurate method for estimation of haplotypes (phasing) from genotype or sequencing data. The source code is available for download here. Check out their home-page for further details. It can be used to phase a small number of samples (a reference panel required) as well as a large number of samples (NO reference panel required). The reference panels and genetic map files required by SHAPEIT are available for download here.

  • The following example shows a typical SHAPEIT command line to phase a LARGE number (>200) of GWAS samples (Gwas.chr20.Unphased.vcf is the quality controlled GWAS data set in VCF format).
shapeit -V Gwas.chr20.Unphased.vcf \
        -M genetic_map_chr20.txt \
        -O Gwas.Chr20.Phased.Output
  • The following example shows a typical SHAPEIT command line to phase a SMALL number (<200) of GWAS samples (Gwas.chr20.Unphased.vcf is the quality controlled GWAS data set in VCF format).
## The following step splits out variants mis-aligned between the reference and gwas panel
shapeit -check \
        -V Gwas.chr20.Unphased.vcf\
        -M genetic_map_chr20.txt \
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
        --output-log gwas.alignments

## The following step phases gwas panel using the reference panel while excluding the markers found in the step above.
shapeit -B gwas \
        -V Gwas.chr20.Unphased.vcf \
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
        --exclude-snp gwas.alignments.strand.exclude \
        -O Gwas.Chr20.Phased.Output

Running Imputation

After the pre-phasing has been done, we can begin to run the imputation. But before that,we need to convert our phased GWAS panel files (obtained above) to VCF format (since Minimac3 can only use VCF format files) and also download the reference panels required for imputation. Consequently, we would have the following steps.

Convert GWAS Panel Files into VCF

If pre-phased GWAS data is available in VCF format, users can skip this step. Otherwise, the following steps show how to convert other format files to VCF format.

  • PLINK: Use PLINK2 (available here) as follows:
plink --bfile Gwas.Chr20.Phased.Output \
      --recode vcf \
      --out Gwas.Chr20.Phased.Output.VCF.format
  • MaCH: Use Mach2VCF (coming soon) as follows:
mach2VCF --haps Gwas.Chr20.Phased.Output.hap \
         --snps Gwas.Chr20.Phased.Output.snps \
         --prefix Gwas.Chr20.Phased.Output.VCF.format
  • SHAPEIT: Use SHAPEIT (available here) as follows:
shapeit -convert \
        --input-haps Gwas.Chr20.Phased.Output \
        --output-vcf Gwas.Chr20.Phased.Output.VCF.format.vcf

Download Reference Panel

Commonly used reference panels are 1000 Genomes Phase 3 (2,535 samples), 1000 Genomes Phase 1 (1,094 samples), HapMap2 (269 samples), Haplotype Reference Consortium (32,914 samples) etc. Users are advised to use either 1000 Genomes Phase 3 (available for download in Reference Panels ) or the Haplotype Reference Consortium (which due to data privacy issues cannot be shared publicly but can be used for imputation remotely on a server through a imputation server setup at University of Michigan). Reference panels for different versions of 1000 Genomes, in both VCF and M3VCF format, are available for download in Reference Panels.

Impute Samples

The final step for imputation involves running Minimac3 to perform the imputation analysis. Now that we have the pre-phased GWAS panel (in VCF format) and the appropriate reference panel (in VCF or M3VCF format), we can run Minimac3 as follows. In the following examples, the first one uses a VCF file for reference (that can be obtained as explained above) and the second example uses a M3VCF file (that might have been downloaded from the links below or created on a previous run of Minimac3).

../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.vcf \ 
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
                --prefix Gwas.Chr20.Imputed.Output
../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.m3vcf \ 
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
                --prefix Gwas.Chr20.Imputed.Output

Chromosome X Imputation

Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately.

The following example illustrates imputation on the non-PAR of chromosome X for males and females separately (files available in Minimac3/test/ directory)

Male Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.males.vcf --prefix testRun

Female Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.females.vcf --prefix testRun

NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.

List of Options

The following table gives a brief description of all the parameters of Minimac3. A detailed description would be available soon.

Parameter Description
--refHaps filename VCF file or M3VCF file containing haplotype data for reference panel.
--passOnly If ON, only variants will FILTER=PASS will be recorded from reference VCF file (does NOT work on M3VCF files yet).
--haps filename File containing haplotype data for target (gwas) samples. Must be a VCF file.
--processReference This option will only convert an input VCF file to M3VCF format (maybe for a later run of imputation). If this option is ON, no imputation would be performed and thus all other parameters will be ignored (of course, except for parameters on Reference Haplotypes and Subsetting Options). This option also does parameter estimation using the reference panel and saves them in the M3VCF file (the estimation can be skipped with rounds = 0)
--prefix output Prefix for all output files generated. By default: [Minimac3.Output]
--updateModel If ON, saved parameter estimates read from a M3VCF file will be further updated using the gwas samples. Will be ignored if VCF reference file. [Default: OFF]
--nobgzip If ON, output files will be NOT bgzipped.
--doseOutput If ON, imputed data will be output as dosage file as well [Default: OFF].
--hapOutput If ON, phased imputed data will be output as well [Default: OFF].
--format Specifies which fields to output for the FORMAT field in output VCF file. Available handles: GT,DS,GP [Default: GT,DS].
--chr 22 Chromosome number for which we will carry out imputation.
--start 100000 Start position for imputation by chunking. Would not work without --chr option.
--end 200000 End position for imputation by chunking. Would not work without --chr option.
--window 5000 Length of buffer region on either side of --start and --end. By default = 0.
--rec Recombination File from previous run of Minimac/Minimac3. (--err parameter must also be provided, if using this handle)
--err Error File from previous run of Minimac/Minimac3. (--rec parameter must also be provided, if using this handle)
--rounds 5 Rounds of optimization for model parameters, which describe population recombination rates and per SNP error rates. By default = 5.
--states 200 Maximum number of reference (or target) haplotypes to be examined during parameter optimization. By default = 200.
--help A short help on options.
--cpus 5 Number of cpus for parallel computing. Would work only with Minimac3-omp.
--noPhoneHome If ON, code will NOT send a SUCCESS/FAILURE status of the execution to home server.
--phoneHomeThinning 50 Percentage probability of sending SUCCESS/FAILURE status of the execution to home server [Default: 50%]

M3VCF Files

M3VCF files stand for " Minimac3 VCF" files and are files that can store data on large reference panels in a compact way, thereby saving on memory required. These files are created on the basis of the same idea as this method of imputation. Since, in small genomic segments, the number of unique haplotypes is much lesser than the total number of haplotypes, we could just store the unique representatives instead of all the haplotypes and thus save on memory required. M3VCF files are a very convenient way to save large reference panels as compared to VCF files because:

  • They require lesser space than VCF files. The compression ratio for a panel of 50K samples and 337K markers is ~1200x (unzipped) and ~4x (zipped).
  • They are faster to read while importing data. Above mentioned reference panel was 20x faster when imported as M3VCF file (as compared to VCF file).
  • They are already stored in a way to attain optimal computational complexity while imputation.


M3VCF files are formatted somewhat following the structure of a VCF files. An example is shown below. The first few lines are header lines and contain information pertaining to number of haplotypes, number of markers and number of genomic segments. Following these, we define each genomic segment (usually denoted by <BLOCK:*-*>) followed by the markers contained in this genomic segment (denoted by their original marker IDs). In the example below, a reference panel of 6 samples (12 haplotypes) and 8 markers was reduced to two genomic segments (<BLOCK:0-5> and <BLOCK:5-7>). The first block is from marker 0 to 5 (with 6 variants) and the next one from 5 to 7 (with 3 variants). Note that two consecutive blocks must overlap at the common marker. The column under FORMAT stores the number of markers in a segment (VARIANTS) and the number of unique haplotypes in that segment (REPS). The following columns represent the unique label for each sample in that block. The numbers represent (under the column of samples) the unique haplotype representative which it resembles in that genomic segment. The unique haplotypes are stored in the following rows in marker x sample format.


In the rows followed by the block identification, the details of the variants are stored (like in a usual VCF file) along with the unique haplotypes (under the FORMAT column). For the <BLOCK:0-5>, we have 4 unique haplotypes (given by the variable REPS) which are the four sub-columns (of 0's and 1's) under the FORMAT column. Similarly, the 2 unique haplotypes for <BLOCK:5-7> are shown in the FORMAT column for its three markers.


##fileformat=M3VCF
##version=1.1
##compression=block
##n_blocks=2
##n_haps=12
##n_markers=8
##<Note=This is NOT a VCF File and cannot be read by vcftools>
#CHROM  POS     ID              REF     ALT     QUAL    FILTER  INFO                 FORMAT    A1    A2    B1    B2    C1    C2    D1    D2    E1    E2    F1    F2
6       73924   <BLOCK:0-5>     .       .       .       .       B1;VARIANTS=6;REPS=4 .         0     1     3     0     0     0     1     0     3     1     0     3
6       73924   chr6:73924:D    AAGAG   A       .       .       B1.M1;R=7;A=5        0000
6       89919   chr6:89919      T       G       .       .       B1.M;R=4;A=3        0100
6       89921   chr6:89921      C       T       .       .       B1.M3;R=2;A=4        0000
6       89932   chr6:89932      A       G       .       .       B1.M4;R=1;A=3        0000
6       89949   chr6:89949      G       A       .       .       B1.M5;R=3;A=1        0010
6       100116  chr6:100116     C       A       .       .       B1.M6;R=2;A=1        0001
6       100116  <BLOCK:5-7>     .       .       .       .       B2;VARIANTS=3;REPS=2  .        0     1     0     0     0     0     1     0     1     1     0     1
6       100116  chr6:100116     T       A       .       .       B1.M8;R=4;A=1        00
6       132285  chr6:132285     T       A       .       .       B1.M9;R=4;A=1        01
6       148689  chr6:148689     TAA     T       .       .       B1.M9;R=4;A=1        01


Reference Panels for Download

Some commonly used reference panels are available for download here. [NOTE: Chromosome X will be be available soon]

Reference Panel Format Download Link Internal CSG Copy Link
1000 Genomes Phase 3 VCF Files Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_3/FOR_UPLOAD/G1K_P3/VCF_Files/
1000 Genomes Phase 3 M3VCF Files (With Parameter Estimates) Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_3/FOR_UPLOAD/G1K_P3/M3VCF_Files_With_Estimates/
1000 Genomes Phase 3 M3VCF Files (Without Parameter Estimates) Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_3/FOR_UPLOAD/G1K_P3/M3VCF_Files_No_Estimates/
1000 Genomes Phase 1 VCF Files Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_1_V3/FOR_UPLOAD/G1K_P1/VCF_Files/
1000 Genomes Phase 1 M3VCF Files (With Parameter Estimates) Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_1_V3/FOR_UPLOAD/G1K_P1/M3VCF_Files_With_Estimates/
1000 Genomes Phase 1 M3VCF Files (Without Parameter Estimates) Coming Soon /net/fantasia/home/sayantan/DATABASE/1000G/PHASE_1_V3/FOR_UPLOAD/G1K_P1/M3VCF_Files_No_Estimates/

Contact

In case of any queries and bugs please contact Sayantan Das.