M3VCF Files

From Genome Analysis Wiki
Revision as of 20:35, 29 January 2015 by Santy.8128 (talk | contribs) (→‎Download)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes (say from MaCH) and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.

This wiki page is designed to give users a detailed explanation on the structure of M3VCF files.

M3VCF Files

M3VCF files stand for " Minimac3 VCF" files and are files that can store data on large reference panels in a compact way, thereby saving on memory required. These files are created on the basis of the same idea as this method of imputation. Since, in small genomic segments, the number of unique haplotypes is much lesser than the total number of haplotypes, we could just store the unique representatives instead of all the haplotypes and thus save on memory required. M3VCF files are a very convenient way to save large reference panels as compared to VCF files because:

  • They require lesser space than VCF files. The compression ratio for a panel of 50K samples and 337K markers is ~1200x (unzipped) and ~4x (zipped).
  • They are faster to read while importing data. Above mentioned reference panel was 20x faster when imported as M3VCF file (as compared to VCF file).
  • They are already stored in a way to attain optimal computational complexity while imputation.


M3VCF files are formatted somewhat following the structure of a VCF files. An example is shown below. The first few lines are header lines and contain information pertaining to number of haplotypes, number of markers and number of genomic segments. Following these, we define each genomic segment (usually denoted by <BLOCK:*-*>) followed by the markers contained in this genomic segment (denoted by their original marker IDs). In the example below, a reference panel of 6 samples (12 haplotypes) and 8 markers was reduced to two genomic segments (<BLOCK:0-5> and <BLOCK:5-7>). The first block is from marker 0 to 5 (with 6 variants) and the next one from 5 to 7 (with 3 variants). Note that two consecutive blocks must overlap at the common marker. The column under FORMAT stores the number of markers in a segment (VARIANTS) and the number of unique haplotypes in that segment (REPS). The following columns represent the unique label for each sample in that block. The numbers represent (under the column of samples) the unique haplotype representative which it resembles in that genomic segment. The unique haplotypes are stored in the following rows in marker x sample format.


In the rows followed by the block identification, the details of the variants are stored (like in a usual VCF file) along with the unique haplotypes (under the FORMAT column). For the <BLOCK:0-5>, we have 4 unique haplotypes (given by the variable REPS) which are the four sub-columns (of 0's and 1's) under the FORMAT column. Similarly, the 2 unique haplotypes for <BLOCK:5-7> are shown in the FORMAT column for its three markers.


##fileformat=M3VCF
##version=1.1
##compression=block
##n_blocks=2
##n_haps=12
##n_markers=8
##<Note=This is NOT a VCF File and cannot be read by vcftools>
#CHROM  POS     ID              REF     ALT     QUAL    FILTER  INFO                 FORMAT    A1    A2    B1    B2    C1    C2    D1    D2    E1    E2    F1    F2
6       73924   <BLOCK:0-5>     .       .       .       .       B1;VARIANTS=6;REPS=4 .         0     1     3     0     0     0     1     0     3     1     0     3
6       73924   chr6:73924:D    AAGAG   A       .       .       B1.M1;R=7;A=5        0000
6       89919   chr6:89919      T       G       .       .       B1.M;R=4;A=3        0100
6       89921   chr6:89921      C       T       .       .       B1.M3;R=2;A=4        0000
6       89932   chr6:89932      A       G       .       .       B1.M4;R=1;A=3        0000
6       89949   chr6:89949      G       A       .       .       B1.M5;R=3;A=1        0010
6       100116  chr6:100116     C       A       .       .       B1.M6;R=2;A=1        0001
6       100116  <BLOCK:5-7>     .       .       .       .       B2;VARIANTS=3;REPS=2  .        0     1     0     0     0     0     1     0     1     1     0     1
6       100116  chr6:100116     T       A       .       .       B1.M8;R=4;A=1        00
6       132285  chr6:132285     T       A       .       .       B1.M9;R=4;A=1        01
6       148689  chr6:148689     TAA     T       .       .       B1.M9;R=4;A=1        01


Download

Minimac3 is available as an undocumented release version. The source files (and binary executable) are available for download in Source Files and commonly used reference panels in VCF and M3VCF formats are available for download in Reference Panels.

Useful Wiki Pages

There are a few pages in this Wiki that may be useful to for Minimac3 users. Here are links to a few:

Contact

In case of any queries and bugs please contact Sayantan Das.