Changes

From Genome Analysis Wiki
Jump to navigationJump to search
4,360 bytes removed ,  00:46, 29 January 2015
Line 211: Line 211:     
NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.
 
NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.
  −
= M3VCF Files =
  −
  −
<code>M3VCF</code> files stand for " '''M'''inimac'''3''' '''VCF'''" files and are files that can store data on large reference panels in a compact way, thereby saving on memory required. These files are created on the basis of the same idea as this method of imputation. Since, in small genomic segments, the number of unique haplotypes is much lesser than the total number of haplotypes, we could just store the unique representatives instead of all the haplotypes and thus save on memory required. <code>M3VCF</code> files are a very convenient way to save large reference panels as compared to VCF files because:
  −
* They require lesser space than VCF files. The compression ratio for a panel of 50K samples and 337K markers is '''~1200x''' (unzipped) and '''~4x''' (zipped).
  −
* They are faster to read while importing data. Above mentioned reference panel was '''20x''' faster when imported as <code>M3VCF</code> file (as compared to VCF file).
  −
* They are already stored in a way to attain optimal computational complexity while imputation.
  −
  −
  −
<code>M3VCF</code> files are formatted ''somewhat'' following the structure of a VCF files. An example is shown below. The first few lines are header lines and contain information pertaining to number of haplotypes, number of markers and number of genomic segments. Following these, we define each genomic segment (usually denoted by <code><BLOCK:*-*></code>) followed by the markers contained in this genomic segment (denoted by their original marker IDs). In the example below, a reference panel of 6 samples (12 haplotypes) and 8 markers was reduced to two genomic segments (<code><BLOCK:0-5></code> and <code><BLOCK:5-7></code>). The first block is from marker 0 to 5 (with 6 variants) and the next one from 5 to 7 (with 3 variants). Note that two consecutive blocks must overlap at the common marker. The column under <code>FORMAT</code> stores the number of markers in a segment (<code>VARIANTS</code>) and the number of unique haplotypes in that segment (<code>REPS</code>). The following columns represent the unique label for each sample in that block. The numbers represent (under the column of samples) the unique haplotype representative which it resembles in that genomic segment. The unique haplotypes are stored in the following rows in marker x sample format.
  −
  −
  −
In the rows followed by the block identification, the details of the variants are stored (like in a usual VCF file) along with the unique haplotypes (under the <code>FORMAT</code> column). For the <code><BLOCK:0-5></code>, we have 4 unique haplotypes (given by the variable <code>REPS</code>) which are the four sub-columns (of 0's and 1's) under the <code>FORMAT</code> column. Similarly, the 2 unique haplotypes for <code><BLOCK:5-7></code> are shown in the <code>FORMAT</code> column for its three markers.
  −
  −
  −
##fileformat=M3VCF
  −
##version=1.1
  −
##compression=block
  −
##n_blocks=2
  −
##n_haps=12
  −
##n_markers=8
  −
##<Note=This is NOT a VCF File and cannot be read by vcftools>
  −
#CHROM  POS    ID              REF    ALT    QUAL    FILTER  INFO                FORMAT    A1    A2    B1    B2    C1    C2    D1    D2    E1    E2    F1    F2
  −
6      73924  '''<BLOCK:0-5>'''    .      .      .      .      B1;'''VARIANTS'''=6;'''REPS'''=4 .        0    1    3    0    0    0    1    0    3    1    0    3
  −
6      73924  chr6:73924:D    AAGAG  A      .      .      B1.M1;R=7;A=5        0000
  −
6      89919  chr6:89919      T      G      .      .      B1.M;R=4;A=3        0100
  −
6      89921  chr6:89921      C      T      .      .      B1.M3;R=2;A=4        0000
  −
6      89932  chr6:89932      A      G      .      .      B1.M4;R=1;A=3        0000
  −
6      89949  chr6:89949      G      A      .      .      B1.M5;R=3;A=1        0010
  −
6      100116  chr6:100116    C      A      .      .      B1.M6;R=2;A=1        0001
  −
6      100116  '''<BLOCK:5-7>'''    .      .      .      .      B2;'''VARIANTS'''=3;'''REPS'''=2  .        0    1    0    0    0    0    1    0    1    1    0    1
  −
6      100116  chr6:100116    T      A      .      .      B1.M8;R=4;A=1        00
  −
6      132285  chr6:132285    T      A      .      .      B1.M9;R=4;A=1        01
  −
6      148689  chr6:148689    TAA    T      .      .      B1.M9;R=4;A=1        01
  −
  −
      
= Reference Panels for Download =  
 
= Reference Panels for Download =  
487

edits

Navigation menu