Minimac3

From Genome Analysis Wiki
Revision as of 02:54, 25 January 2015 by Santy.8128 (talk | contribs) (Created page with "= Introduction = '''Minimac3 ''' is a lower memory and more computationally efficient implementation of [http://genome.sph.umich.edu/wiki/Minimac2 minimac2]. It is an algorit...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH). minimac3 is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy. This algorithm analyzes only the unique sets of haplotypes in small genomic segments, thereby saving on time-complexity, computational memory but no loss in degree of accuracy.

Minimac3, apart from performing imputation, also creates M3VCF files (customized minimac3 VCF files) which are able to store reference panel information in a compact form, thus saving on memory and time required to read large datasets. User will have an option to use the binary code to either just convert VCF files to M3VCF files or to perform imputation as well. The code can also take a previously generated M3VCF file as input for the reference panel. M3VCF files can also store pre-calculated estimates of recombination fraction and error, which can be used for later runs of imputation. The latest version of Minimac3 also allows output in the form of VCF files for easier data manipulation in downstream analysis.

Download

Minimac3 is available as an undocumented release version. The source files and commonly used reference panels in M3VCF format will be available for download here. The authors would really appreciate if users could test it on their data set and let us know of possible bugs to be fixed.

You can either copy the directory from fantasia OR download it from the link below.

  • To copy from fantasia:
 cp -r /net/fantasia/home/sayantan/Softwares/Minimac3/ LocalDirectoryMinimac3/
  • To Download
Description Download Link
Minimac3 Executable UNIX Users
Minimac3-omp Executable (for parallel computing) UNIX Users
Minimac3 Source Files UNIX Users

Usage

Users can always type the following for further support:

 /bin/Minimac3 --help

A typical Minimac3 command line would have the following parameter options:

Command Line Options:
   Reference Haplotypes : --refHaps [], --passOnly
      Target Haplotypes : --haps []
      Output Parameters : --processReference, --prefix [Minimac3.Output],
                          --updateModel, --nobgzip, --doseOutput, --hapOutput,
                          --format [GT,DS]
      Subset Parameters : --chr [], --start, --end, --window
    Starting Parameters : --rec [], --err []
  Estimation Parameters : --rounds [5], --states [200]
       Other Parameters : --help, --cpus [1], --params
              PhoneHome : --noPhoneHome, --phoneHomeThinning [50]


The most typically used parameter options are explained below with examples. See subsection below for detailed list of available options.

Reference Haplotypes

Minimac3 can handle either VCF files or M3VCF files as input for the reference panel. The program can itself identify the type of file, and no handle is necessary for that. M3VCF files are customized files created by Minimac3 (possibly in some previous run) that stores large reference panels in a compact form so as to save memory and computation time involved in reading large files. M3VCF files must be generated in some previous run of Minimac3 and can be saved and used in later runs for faster loading of data. See section on M3VCF files and examples below on how to use them.

"--refHaps" denotes the main input reference file which is either a VCF file or M3VCF file. No handle is necessary for denoting type of file, program will detect it itself.

Target Haplotypes

Minimac3 can handle either VCF files or MaCH files as input for the target/gwas data. The program can itself identify the type of file, and no handle is necessary for that. Note that input VCF files would be automatically assumed to be phased.

"--haps" denotes the main input target file which is either a VCF file (.vcf or .vcf.gz) or a MaCH phased file (.mach or .haps). The extensions are not mandatory. Zipped formats are supported. See examples below.

"--snps" denotes the marker name input file for target data panel. This parameter is mandatory if user is using MaCH phased files for the target panel and will be ignored if VCF files are used. Markers which are in the target panel and NOT in the reference panel would be excluded from the output files. User must merge these extra markers back to the original data in order to analyze them.

Examples

To look at the examples, the folder of Minimac3 needs to be copied to the users local directory first. Then move to the folder LocalDirectory/test/

 cp -r /net/fantasia/home/sayantan/Softwares/Minimac3/ LocalDirectoryMinimac3/
 cd LocalDirectoryMinimac3/test/

The following example uses a VCF reference file [refPanel.vcf] and a VCF target sample file [targetStudy.vcf]

 ../bin/Minimac3 --refHaps refPanel.vcf --haps targetStudy.vcf --prefix testRun

The following example is same as above but uses minimac3-omp (which is implemented using openMP programming enabling parallel computing).

 ../bin/Minimac3-omp --refHaps refPanel.vcf --haps targetStudy.vcf --prefix testRun --cpus 5

The following example converts a VCF reference file into M3VCF (only). It also does parameter estimation based on the reference panel using leave-one-out method and saves them in the M3VCF file. The parameter estimation can be skipped with "--rounds = 0". If the option "--processReference" is ON, no imputation will be done, only compression of file from VCF to M3VCF format will be done.

../bin/Minimac3 --refHaps refPanel.vcf --processReference --prefix testRun

The following example uses a M3VCF file (which was created in the previous example) and VCF target sample files (targetStudy.vcf) for imputation.

../bin/Minimac3 --refHaps testRun.m3vcf.gz --haps targetStudy.vcf --prefix testRun

[NOTE: In the example above, if testRun.m3vcf.gz was created with rounds = 0, it would contain no parameter estimates. Note that the program works with the saved estimates when available (as in the example above), whereas it does parameter estimation when the estimates are NOT available (as in the example below which is created with rounds = 0)]

../bin/Minimac3 --refHaps refPanel.vcf --processReference --rounds 0 --prefix testRun
../bin/Minimac3 --refHaps testRun.m3vcf.gz --haps targetStudy.vcf --prefix testRun

The following example also uses a M3VCF reference file [refPanel.m3vcf.gz] and a VCF target sample file [targetStudy.vcf]. However, it only analyzes chromosome 6 from position 505988 to 873131 (allowing a buffer of 100 bp on either side). It also outputs a phased haplotype file (using --hapOutput, option) and the usual dosage file (using --doseOutput, option)

../bin/Minimac3 --refHaps testRun.m3vcf.gz --chr 6 --start 505988 --end 873131 --window 100 --haps targetStudy.vcf --prefix testRun --hapOutput --doseOutput

For examples on imputation of chromosome X, see Chromosome X Imputation

Chromosome X Imputation

Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately.

The following example illustrates imputation on the non-PAR of chromosome X for males and females separately (files available in Minimac3/test/ directory)

Male Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.males.vcf --prefix testRun

Female Samples (Non-PAR)

 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf --haps targetStudyChrX.females.vcf --prefix testRun

NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.

List of Options

The following table gives a brief description of all the parameters of Minimac3. A detailed description would be available soon.

Parameter Description
--refHaps filename VCF file or M3VCF file containing haplotype data for reference panel.
--passOnly If ON, only variants will FILTER=PASS will be recorded from reference VCF file (does NOT work on M3VCF files yet).
--haps filename File containing haplotype data for target (gwas) samples. Must be a VCF file.
--processReference This option will only convert an input VCF file to M3VCF format (maybe for a later run of imputation). If this option is ON, no imputation would be performed and thus all other parameters will be ignored (of course, except for parameters on Reference Haplotypes and Subsetting Options). This option also does parameter estimation using the reference panel and saves them in the M3VCF file (the estimation can be skipped with rounds = 0)
--prefix output Prefix for all output files generated. By default: [Minimac3.Output]
--updateModel If ON, saved parameter estimates read from a M3VCF file will be further updated using the gwas samples. Will be ignored if VCF reference file. [Default: OFF]
--nobgzip If ON, output files will be NOT bgzipped.
--doseOutput If ON, imputed data will be output as dosage file as well [Default: OFF].
--hapOutput If ON, phased imputed data will be output as well [Default: OFF].
--format Specifies which fields to output for the FORMAT field in output VCF file. Available handles: GT,DS,GP [Default: GT,DS].
--chr 22 Chromosome number for which we will carry out imputation.
--start 100000 Start position for imputation by chunking. Would not work without --chr option.
--end 200000 End position for imputation by chunking. Would not work without --chr option.
--window 5000 Length of buffer region on either side of --start and --end. By default = 0.
--rec Recombination File from previous run of Minimac/Minimac3. (--err parameter must also be provided, if using this handle)
--err Error File from previous run of Minimac/Minimac3. (--rec parameter must also be provided, if using this handle)
--rounds 5 Rounds of optimization for model parameters, which describe population recombination rates and per SNP error rates. By default = 5.
--states 200 Maximum number of reference (or target) haplotypes to be examined during parameter optimization. By default = 200.
--help A short help on options.
--cpus 5 Number of cpus for parallel computing. Would work only with Minimac3-omp.
--noPhoneHome If ON, code will NOT send a SUCCESS/FAILURE status of the execution to home server.
--phoneHomeThinning 50 Percentage probability of sending SUCCESS/FAILURE status of the execution to home server [Default: 50%]

M3VCF Files

M3VCF files stand for " Minimac3 VCF" files and are files that can store data on large reference panels in a compact way, thereby saving on memory required. These files are created on the basis of the same idea as this method of imputation. Since, in small genomic segments, the number of unique haplotypes is much lesser than the total number of haplotypes, we could just store the unique representatives instead of all the haplotypes and thus save on memory required. M3VCF files are a very convenient way to save large reference panels as compared to VCF files because:

  • They require lesser space than VCF files. The compression ratio for a panel of 50K samples and 337K markers is ~1200x (unzipped) and ~4x (zipped).
  • They are faster to read while importing data. Above mentioned reference panel was 20x faster when imported as M3VCF file (as compared to VCF file).
  • They are already stored in a way to attain optimal computational complexity while imputation.


M3VCF files are formatted somewhat following the structure of a VCF files. An example is shown below. The first few lines are header lines and contain information pertaining to number of haplotypes, number of markers and number of genomic segments. Following these, we define each genomic segment (usually denoted by <BLOCK:*-*>) followed by the markers contained in this genomic segment (denoted by their original marker IDs). In the example below, a reference panel of 6 samples (12 haplotypes) and 8 markers was reduced to two genomic segments (<BLOCK:0-5> and <BLOCK:5-7>). The first block is from marker 0 to 5 (with 6 variants) and the next one from 5 to 7 (with 3 variants). Note that two consecutive blocks must overlap at the common marker. The column under FORMAT stores the number of markers in a segment (VARIANTS) and the number of unique haplotypes in that segment (REPS). The following columns represent the unique label for each sample in that block. The numbers represent (under the column of samples) the unique haplotype representative which it resembles in that genomic segment. The unique haplotypes are stored in the following rows in marker x sample format.


In the rows followed by the block identification, the details of the variants are stored (like in a usual VCF file) along with the unique haplotypes (under the FORMAT column). For the <BLOCK:0-5>, we have 4 unique haplotypes (given by the variable REPS) which are the four sub-columns (of 0's and 1's) under the FORMAT column. Similarly, the 2 unique haplotypes for <BLOCK:5-7> are shown in the FORMAT column for its three markers.


##fileformat=M3VCF
##version=1.1
##compression=block
##n_blocks=2
##n_haps=12
##n_markers=8
##<Note=This is NOT a VCF File and cannot be read by vcftools>
#CHROM  POS     ID              REF     ALT     QUAL    FILTER  INFO                 FORMAT    A1    A2    B1    B2    C1    C2    D1    D2    E1    E2    F1    F2
6       73924   <BLOCK:0-5>     .       .       .       .       B1;VARIANTS=6;REPS=4 .         0     1     3     0     0     0     1     0     3     1     0     3
6       73924   chr6:73924:D    AAGAG   A       .       .       B1.M1;R=7;A=5        0000
6       89919   chr6:89919      T       G       .       .       B1.M;R=4;A=3        0100
6       89921   chr6:89921      C       T       .       .       B1.M3;R=2;A=4        0000
6       89932   chr6:89932      A       G       .       .       B1.M4;R=1;A=3        0000
6       89949   chr6:89949      G       A       .       .       B1.M5;R=3;A=1        0010
6       100116  chr6:100116     C       A       .       .       B1.M6;R=2;A=1        0001
6       100116  <BLOCK:5-7>     .       .       .       .       B2;VARIANTS=3;REPS=2  .        0     1     0     0     0     0     1     0     1     1     0     1
6       100116  chr6:100116     T       A       .       .       B1.M8;R=4;A=1        00
6       132285  chr6:132285     T       A       .       .       B1.M9;R=4;A=1        01
6       148689  chr6:148689     TAA     T       .       .       B1.M9;R=4;A=1        01

Contact

In case of any queries and bugs please contact Sayantan Das.