Changes

From Genome Analysis Wiki
Jump to navigationJump to search
7,004 bytes removed ,  00:47, 29 January 2015
Line 47: Line 47:     
   /bin/Minimac3 --help
 
   /bin/Minimac3 --help
  −
= Imputation Cookbook =
  −
  −
This section gives a brief summary of the steps required to go through an experiment of imputation on typical GWAS samples. Before pre-phasing and imputation, users must ensure that their data is quality controlled. Standard quality control filters involve excluding markers with high missingness rate, high deviations from Hardy-Weinberg equilibrium, high discordance rates (if duplicate copies available), excess Mendelian inconsistencies etc. and removing samples with high missingness rate, unusual heterozygosity, high inbreeding coefficient, clear evidence of being genetic ancestry outliers, evidence of relatedness etc. All of these steps can be easily carried out using [http://pngu.mgh.harvard.edu/~purcell/plink/plink2.shtml PLINK]. With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.
  −
  −
Once a quality controlled dataset is available we need to pre-phase the data followed by imputation. The steps are explained below.
  −
  −
== Pre-Phasing the GWAS data ==
  −
  −
Pre-Phasing can be done using either [http://www.sph.umich.edu/csg/abecasis/MaCH/ MaCH] or [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html SHAPEIT].
  −
  −
=== MaCH ===
  −
  −
'''MaCH''' is a Markov Chain based haplotyper. It can resolve long haplotypes in samples of unrelated individuals. The source code is available for download [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ here]. Check out their [http://www.sph.umich.edu/csg/abecasis/MaCH/ home-page] for further details.
  −
  −
A typical command line to phase using MaCH looks like this (<code>Gwas.chr20.Unphased.dat</code> and <code>Gwas.chr20.Unphased.ped </code> is the quality controlled GWAS data set in [http://www.sph.umich.edu/csg/abecasis/Merlin/ Merlin] format)
  −
  −
mach1 -d Gwas.chr20.Unphased.dat \
  −
      -p Gwas.chr20.Unphased.ped \
  −
      --rounds 20 \
  −
      --states 200 \
  −
      --phase \
  −
      --interim 5 \
  −
      --sample 5 \
  −
      --prefix Gwas.Chr20.Phased.Output
  −
  −
=== SHAPEIT===
  −
  −
'''SHAPEIT''' is a fast and accurate method for estimation of haplotypes (phasing) from genotype or sequencing data. The source code is available for download [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download here]. Check out their [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html home-page] for further details. It can be used to phase a small number of samples (a reference panel required) as well as a large number of samples (NO reference panel required). The reference panels and genetic map files required by SHAPEIT are available for download [https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference here].
  −
  −
* The following example shows a typical SHAPEIT command line to phase a LARGE number (>200) of GWAS samples (<code>Gwas.chr20.Unphased.vcf</code> is the quality controlled GWAS data set in VCF format).
  −
  −
shapeit -V Gwas.chr20.Unphased.vcf \
  −
        -M genetic_map_chr20.txt \
  −
        -O Gwas.Chr20.Phased.Output
  −
  −
* The following example shows a typical SHAPEIT command line to phase a SMALL number (<200) of GWAS samples (<code>Gwas.chr20.Unphased.vcf</code> is the quality controlled GWAS data set in VCF format).
  −
  −
## The following step splits out variants mis-aligned between the reference and gwas panel
  −
shapeit -check \
  −
        -V Gwas.chr20.Unphased.vcf\
  −
        -M genetic_map_chr20.txt \
  −
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
  −
        --output-log gwas.alignments
  −
  −
## The following step phases gwas panel using the reference panel while excluding the markers found in the step above.
  −
shapeit -B gwas \
  −
        -V Gwas.chr20.Unphased.vcf \
  −
        --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \
  −
        --exclude-snp gwas.alignments.strand.exclude \
  −
        -O Gwas.Chr20.Phased.Output
  −
  −
== Running Imputation ==
  −
  −
After the pre-phasing has been done, we can begin to run the imputation. But before that,we need to convert our phased GWAS panel files (obtained above) to VCF format (since Minimac3 can only use VCF format files) and also download the reference panels required for imputation. Consequently, we would have the following steps.
  −
  −
===Convert GWAS Panel Files into VCF ===
  −
  −
If pre-phased GWAS data is available in VCF format, users can skip this step. Otherwise, the following steps show how to convert other format files to VCF format.
  −
  −
* '''PLINK:''' Use PLINK2 (available [https://www.cog-genomics.org/plink2 here]) as follows:
  −
  −
plink --bfile Gwas.Chr20.Phased.Output \
  −
      --recode vcf \
  −
      --out Gwas.Chr20.Phased.Output.VCF.format
  −
  −
* '''MaCH:''' Use Mach2VCF (coming soon) as follows:
  −
  −
mach2VCF --haps Gwas.Chr20.Phased.Output.hap \
  −
          --snps Gwas.Chr20.Phased.Output.snps \
  −
          --prefix Gwas.Chr20.Phased.Output.VCF.format
  −
  −
* '''SHAPEIT:''' Use SHAPEIT (available [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download here]) as follows:
  −
  −
shapeit -convert \
  −
        --input-haps Gwas.Chr20.Phased.Output \
  −
        --output-vcf Gwas.Chr20.Phased.Output.VCF.format.vcf
  −
  −
=== Download Reference Panel ===
  −
  −
Commonly used reference panels are 1000 Genomes Phase 3 (2,535 samples), 1000 Genomes Phase 1 (1,094 samples), HapMap2 (269 samples), Haplotype Reference Consortium (32,914 samples) etc. Users are advised to use either 1000 Genomes Phase 3 (available for download in [[#Reference Panels for Download |Reference Panels ]]) or the Haplotype Reference Consortium (which due to data privacy issues cannot be shared publicly but can be used for imputation remotely on a server through a [http://imputationserver.sph.umich.edu/ imputation server] setup at University of Michigan). Reference panels for different versions of 1000 Genomes, in both VCF and <code>M3VCF</code> format, are available for download in [[#Reference Panels for Download |Reference Panels]].
  −
  −
=== Impute Samples ===
  −
  −
The final step for imputation involves running '''Minimac3''' to perform the imputation analysis. Now that we have the pre-phased GWAS panel (in VCF format) and the appropriate reference panel (in VCF or <code>M3VCF</code> format), we can run Minimac3 as follows. In the following examples, the first one uses a VCF file for reference (that can be obtained as explained above) and the second example uses a <code>M3VCF</code> file (that might have been downloaded from the links [[#Reference Panels for Download|below]] or created on a previous run of Minimac3).
  −
  −
../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.vcf \
  −
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
  −
                --prefix Gwas.Chr20.Imputed.Output
  −
  −
../bin/Minimac3 --refHaps ReferencePanel.Chr20.1000Genomes.m3vcf \
  −
                --haps Gwas.Chr20.Phased.Output.VCF.format.vcf \
  −
                --prefix Gwas.Chr20.Imputed.Output
      
= Chromosome X Imputation =
 
= Chromosome X Imputation =
487

edits

Navigation menu