Difference between revisions of "Minimac4"

From Genome Analysis Wiki
Jump to: navigation, search
(What's New)
(What's New)
Line 16: Line 16:
 
The input file format, output file formats and typical command lines are same in Minimac4 (as they were in minimac3). Some of the main new features are summarized below:
 
The input file format, output file formats and typical command lines are same in Minimac4 (as they were in minimac3). Some of the main new features are summarized below:
  
* '''Automated Chunking - ''' Minimac4 automatically chunks the whole chromosome (into overlapping chunks), analyzes each chunk sequentially and then concatenates the imputed chunks back. This caps the memory usage across different chromosomes (larger chromosomes need the same amount of memory as smaller ones). The length of the chunk and the overlap can be controlled by the parameters <code>--chunkLengthMb</code> and <code>--chunkLengthOverlapMb</code>, although we recommend using the default values.  
+
* '''Improved Speed - ''' Minimac4 is approximately '''6 times''' faster for 1000 Genomes Phase 1 and Phase 3 and '''2 times''' faster for the HRC reference panels at negligible fall in accuracy (details of accuracy for imputing into 10 European samples are given here). The speed can be further improved by tuning the approximation parameters (see below), but we recommend using the default values.
  
* '''Approximations - ''' Minimac4 uses some simple approximations to speed up the imputation analyses. The levels of approximation can be controlled by the parameters <code>--probThreshold</code>, <code>--diffThreshold</code>, and <code>--topThreshold</code> (details given in Minimac4 Usage). Higher levels of approximation will reduce the compute time but also marginally reduce the imputation accuracy. We recommend using the default values.
+
* '''Automated Chunking - ''' Minimac4 automatically chunks the whole chromosome (into overlapping chunks), analyzes each chunk sequentially and then concatenates the imputed chunks back. This caps the memory usage across different chromosomes (larger chromosomes need the same amount of memory as smaller ones). The length of the chunk and the overlap can be controlled by the <code>--chunkLengthMb 20</code> and <code>--chunkLengthOverlapMb 3</code>, although we recommend using the default values.
 +
 
 +
* '''Approximations - ''' Minimac4 uses some simple approximations to speed up the imputation analyses. The levels of approximation can be controlled by the parameters <code>--probThreshold</code>, <code>--diffThreshold</code>, and <code>--topThreshold</code> (details given in Minimac4 Usage). Higher values of approximation will reduce the compute time but also marginally reduce the imputation accuracy. We recommend using the default values (= 0.01).
 +
 
 +
* '''Improved Chromosome X/Y Support - ''' Minimac4 can handle different ploidys in the same GWAS file for imputation of sex chromosomes. For example, for the non-PAR region on chromosome X, males can imputed together with females, irrespective of whether males are coded as haploids or diploids. However, each sample must have a fixed ploidy. Thus, PAR and non-PAR regions still need to be imputed separately. But, males and females need not be separated. Please Chromosome X Imputation for more details.
  
 
* '''Other Helpful Features'''
 
* '''Other Helpful Features'''
 
** We introduced a parameter <code>--memUsage</code> that will estimate and report the memory required by the imputation experiment. This feature should be useful for users running their jobs on a compute cluster that requires memory specification.
 
** We introduced a parameter <code>--memUsage</code> that will estimate and report the memory required by the imputation experiment. This feature should be useful for users running their jobs on a compute cluster that requires memory specification.
 
** We introduced some other FORMAT options for the output dosage data which should be sufficient to enable users to retrieve haplotype dosages, genotype probabilities, genotype dosages or any other measure of summary that they want.
 
** We introduced some other FORMAT options for the output dosage data which should be sufficient to enable users to retrieve haplotype dosages, genotype probabilities, genotype dosages or any other measure of summary that they want.
 +
** We have fixed the bug related to FILTER=GENOTYPED and FILTER=GENOTYPED_ONLY which was causing a crash in bcftools.
  
 
= Reference Panels for Download =  
 
= Reference Panels for Download =  

Revision as of 02:57, 12 July 2017

Introduction

Minimac4 is a latest version in the series of genotype imputation software - preceded by Minimac3 (2015), Minimac2 (2014), minimac (2012) and MaCH (2010). Minimac4 is a lower memory and more computationally efficient implementation of the original algorithms with negligible fall in imputation quality.

The Minimac3 mailing list has been renamed as the Minimac4 mailing list. If you were already a member, no need to re-join. If not, please join our mailing list to get updates about future releases or report possible bugs or email them to Sayantan Das.

Download

Minimac4 (version 1.0.2, updated 6.29.2017) is currently available for testing purposes only (while we still run more tests and wait on feedback about potential bugs). Commonly used reference panels in M3VCF format are available for download in Reference Panels.

Github Repo: : Minimac4 Github

What's New

The input file format, output file formats and typical command lines are same in Minimac4 (as they were in minimac3). Some of the main new features are summarized below:

  • Improved Speed - Minimac4 is approximately 6 times faster for 1000 Genomes Phase 1 and Phase 3 and 2 times faster for the HRC reference panels at negligible fall in accuracy (details of accuracy for imputing into 10 European samples are given here). The speed can be further improved by tuning the approximation parameters (see below), but we recommend using the default values.
  • Automated Chunking - Minimac4 automatically chunks the whole chromosome (into overlapping chunks), analyzes each chunk sequentially and then concatenates the imputed chunks back. This caps the memory usage across different chromosomes (larger chromosomes need the same amount of memory as smaller ones). The length of the chunk and the overlap can be controlled by the --chunkLengthMb 20 and --chunkLengthOverlapMb 3, although we recommend using the default values.
  • Approximations - Minimac4 uses some simple approximations to speed up the imputation analyses. The levels of approximation can be controlled by the parameters --probThreshold, --diffThreshold, and --topThreshold (details given in Minimac4 Usage). Higher values of approximation will reduce the compute time but also marginally reduce the imputation accuracy. We recommend using the default values (= 0.01).
  • Improved Chromosome X/Y Support - Minimac4 can handle different ploidys in the same GWAS file for imputation of sex chromosomes. For example, for the non-PAR region on chromosome X, males can imputed together with females, irrespective of whether males are coded as haploids or diploids. However, each sample must have a fixed ploidy. Thus, PAR and non-PAR regions still need to be imputed separately. But, males and females need not be separated. Please Chromosome X Imputation for more details.
  • Other Helpful Features
    • We introduced a parameter --memUsage that will estimate and report the memory required by the imputation experiment. This feature should be useful for users running their jobs on a compute cluster that requires memory specification.
    • We introduced some other FORMAT options for the output dosage data which should be sufficient to enable users to retrieve haplotype dosages, genotype probabilities, genotype dosages or any other measure of summary that they want.
    • We have fixed the bug related to FILTER=GENOTYPED and FILTER=GENOTYPED_ONLY which was causing a crash in bcftools.

Reference Panels for Download

Some commonly used reference panels are available for download here:

Reference Panel Number
of Samples
File Format Parameter
Estimates
Available
Chromosomes Link
1000 Genomes

Phase 3
(version 5)

2,504 VCF - 1-22,X Download
M3VCF YES 1-22,X Download
NO 1-22,X Download
VCF,M3VCF YES X Download
1000 Genomes

Phase 1
(version 3)

1,092 VCF - 1-22,X Download
M3VCF YES 1-22,X Download
NO 1-22,X Download
VCF,M3VCF YES X Download