Difference between revisions of "Minimac3 Cookbook : Chromosome X Imputation"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
= Chromosome X Imputation =  
 
= Chromosome X Imputation =  
  
Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region (non-PAR) needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately. This wiki page gives further details on imputing chromosome X.
+
Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region (non-PAR) needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately. Following should be the steps involved in imputing chromosome X.
  
 +
* '''Convert files to VCF Format:''' Start by converting the unphased, quality controlled data set into VCF format. See our wiki page on [[Minimac3 Cookbook : Converting Files to VCF| Converting to VCF]] for more details on how to convert.
  
* '''Split the data by Sex''' : Start by splitting the unphased, quality controlled data set by sex.
+
* '''Split the data into PAR and non-PAR:''' Separate the pseudo-autosomal part and non-pseudo-autosomal part into separate files. The non-PAR is located on <font face=Courier>'''chrX:2699520-154931043'''</font> on build hg19. The split can be done for VCF files as follows.
  
* '''Split the data into PAR and non-PAR:''' Separate the pseudo-autosomal part and non-pseudo-autosomal part into separate files. The PAR is located on <font face=Courier>'''chrX:1-2709520'''</font> and <font face=Courier>'''chrX:154584238-154913754'''</font> on build hg18 and <font face=Courier>'''chrX:60001-2699519'''</font> and <font face=Courier>'''chrX:154931044-155260560'''</font> on build hg19. The split can be done for VCF files as follows (for build hg19):
+
  vcftools --gzvcf gwas.data.vcf.gz \
 
+
          --chr X \
  vcftools --gzvcf males.gwas.data.vcf.gz \
 
 
           --from-bp 2699520 \
 
           --from-bp 2699520 \
 
           --to-bp 154931043 \
 
           --to-bp 154931043 \
 
           --recode \
 
           --recode \
           --out males.non.PAR.gwas.data
+
           --out Non.PAR.gwas.data
 
  &nbsp;
 
  &nbsp;
  vcftools --gzvcf males.gwas.data.vcf.gz \
+
  vcftools --gzvcf gwas.data.vcf.gz \
           --exclude-positions males.non.PAR.gwas.data.recode.vcf \
+
           --exclude-positions Non.PAR.gwas.data.recode.vcf \
 +
          --recode \
 +
          --out PAR.gwas.data
 +
 
 +
'''NOTE''': After this step, please verify that the male samples have only one haplotype in <font face=Courier>Non.PAR.gwas.data.recode.vcf</font> and two haplotypes in <font face=Courier>PAR.gwas.data.recode.vcf</font>
 +
 
 +
* '''Split the non-PAR data by Sex:''' Separate the non-PAR data by sex, which can also be done by vcftools as follows. Note that the <font face=Courier>PAR.gwas.data.recode.vcf</font> need NOT be separated since both males and females are diploids there.
 +
 
 +
vcftools --vcf Non.PAR.gwas.data.recode.vcf \
 +
          --keep male.sample.list        ## or female.sample.list \
 
           --recode \
 
           --recode \
           --out males.PAR.gwas.data
+
           --out Male.Non.PAR.gwas.data  ## or Female.Non.PAR.gwas.data \
  
* '''Impute Sex and PAR/non-PAR separately:''' The following example illustrates how to do that (files available in <code>Minimac3/test/</code> directory)
+
* '''Pre-phase PAR data and female non-PAR data:''' Out of the three available data, only the PAR data and female non-PAR data have two haplotypes and thus need to be phased, while the male non-PAR data has haploids and need not be phased. See our wiki page on [[Minimac3 Cookbook : Pre-Phasing| Pre-Phasing]] and [[Minimac3 Cookbook : Converting Files to VCF| Converting to VCF]] for further details on pre-phasing and converting files back to VCF format.
  
  # Male Samples (Non-PAR)
+
* '''Impute Data:''' The following example illustrates how to impute into the phased PAR data (both males and females together), phased female non-PAR data and haploid male non-PAR data (same as obtained after splitting the non-PAR by sex) as follows:
   ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
+
 
                   --haps targetStudyChrX.males.vcf \
+
  # Phased All Samples (PAR)
                   --prefix testRun.males.Non.PAR
+
   ../bin/Minimac3 --refHaps refPanelChrX.Auto.vcf \
 +
                   --haps Phased.PAR.gwas.data.vcf \
 +
                   --prefix testRun.All.PAR
 
  &nbsp;
 
  &nbsp;
  # Female Samples (Non-PAR)
+
  # Phased Female Samples (Non-PAR)
 
   ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
 
   ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
                   --haps targetStudyChrX.females.vcf \
+
                   --haps Phased.Female.Non.PAR.gwas.data.vcf \
 
                   --prefix testRun.females.Non.PAR
 
                   --prefix testRun.females.Non.PAR
 
  &nbsp;
 
  &nbsp;
  # Male Samples (PAR)
+
  # Haploid Male Samples (Non-PAR)
 
   ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
 
   ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
                   --haps targetStudyChrX.males.vcf \
+
                   --haps Male.Non.PAR.gwas.data.recode.vcf \
                  --prefix testRun.males.PAR
+
                   --prefix testRun.males.Non.PAR
&nbsp;
 
# Female Samples (PAR)
 
  ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
 
                  --haps targetStudyChrX.females.vcf \
 
                   --prefix testRun.females.PAR
 
  
 
* '''NOTE:''' For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.
 
* '''NOTE:''' For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.

Latest revision as of 13:19, 10 September 2015

Introduction

Minimac3 is a lower memory and more computationally efficient implementation of minimac2. It is an algorithm for genotypic imputation that works on phased genotypes and is designed to handle very large reference panels in a more computationally efficient way with no loss of accuracy.

This wiki page is designed to give users a detailed step-by-step description on imputing chromosome X.

Chromosome X Imputation

Chromosome X has a pseudo-autosomal region (PAR) which can be imputed for males and females together. Imputing the PAR on chromosome X is same as usual imputation, since both males and females are diploids at these sites. However, the non pseudo-autosomal region (non-PAR) needs to be imputed for males and females separately, as males are haploids while females are diploids. Of course, the PAR and non-PAR regions need to be imputed separately. Following should be the steps involved in imputing chromosome X.

  • Convert files to VCF Format: Start by converting the unphased, quality controlled data set into VCF format. See our wiki page on Converting to VCF for more details on how to convert.
  • Split the data into PAR and non-PAR: Separate the pseudo-autosomal part and non-pseudo-autosomal part into separate files. The non-PAR is located on chrX:2699520-154931043 on build hg19. The split can be done for VCF files as follows.
vcftools --gzvcf gwas.data.vcf.gz \
         --chr X \
         --from-bp 2699520 \
         --to-bp 154931043 \
         --recode \
         --out Non.PAR.gwas.data
 
vcftools --gzvcf gwas.data.vcf.gz \
         --exclude-positions Non.PAR.gwas.data.recode.vcf \
         --recode \
         --out PAR.gwas.data

NOTE: After this step, please verify that the male samples have only one haplotype in Non.PAR.gwas.data.recode.vcf and two haplotypes in PAR.gwas.data.recode.vcf

  • Split the non-PAR data by Sex: Separate the non-PAR data by sex, which can also be done by vcftools as follows. Note that the PAR.gwas.data.recode.vcf need NOT be separated since both males and females are diploids there.
vcftools --vcf Non.PAR.gwas.data.recode.vcf \
         --keep male.sample.list        ## or female.sample.list \
         --recode \
         --out Male.Non.PAR.gwas.data   ## or Female.Non.PAR.gwas.data \
  • Pre-phase PAR data and female non-PAR data: Out of the three available data, only the PAR data and female non-PAR data have two haplotypes and thus need to be phased, while the male non-PAR data has haploids and need not be phased. See our wiki page on Pre-Phasing and Converting to VCF for further details on pre-phasing and converting files back to VCF format.
  • Impute Data: The following example illustrates how to impute into the phased PAR data (both males and females together), phased female non-PAR data and haploid male non-PAR data (same as obtained after splitting the non-PAR by sex) as follows:
# Phased All Samples (PAR)
 ../bin/Minimac3 --refHaps refPanelChrX.Auto.vcf \
                 --haps Phased.PAR.gwas.data.vcf \
                 --prefix testRun.All.PAR
 
# Phased Female Samples (Non-PAR)
 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
                 --haps Phased.Female.Non.PAR.gwas.data.vcf \
                 --prefix testRun.females.Non.PAR
 
# Haploid Male Samples (Non-PAR)
 ../bin/Minimac3 --refHaps refPanelChrX.Non.Auto.vcf \
                 --haps Male.Non.PAR.gwas.data.recode.vcf \
                 --prefix testRun.males.Non.PAR
  • NOTE: For imputing non-PAR of chromosome X, user must analyze male and female samples separately, otherwise program would crash. User should also ensure that the reference panel consists of only PAR or non-PAR region of chromosome X, otherwise program would crash.

Download

Minimac3 is currently available as a pre-release. The source files (and binary executable) are available for download in Source Files and commonly used reference panels in VCF and M3VCF formats are available for download in Reference Panels.

Useful Wiki Pages

There are a few pages in this Wiki that may be useful to for Minimac3 users. Here are links to a few:

Contact

In case of any queries and bugs please contact Sayantan Das.