Difference between revisions of "Minimac: 1000 Genomes Imputation Cookbook"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(72 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page documents how to impute 1000 Genome SNPs using [[Minimac]], which is typically the preferred approach for imputation using large reference panels such as the 1000 Genomes data. For information on how to carry out 1000 Genomes Imputation using [[MaCH]], see [[MaCH: 1000 Genomes Imputation Cookbook]]. For pointers on how to carry out 1000 Genomes Imputation using [[IMPUTE2]], see [[IMPUTE2: 1000 Genomes Imputation Cookbook]].
+
This page documents how to impute 1000 Genome SNPs using [[Minimac]], which is typically the preferred approach for imputation using large reference panels such as the 1000 Genomes data. For pointers on how to carry out 1000 Genomes Imputation using [[IMPUTE2]], see [[IMPUTE2:_1000_Genomes_Imputation_Cookbook]].  
  
== Should I use MaCH or minimac? ==
+
Before reading this tutorial, you might find it useful to spend a few minutes reading through the main [[Minimac]] documentation.  
The imputation quality of both approaches is comparable.
 
 
 
If you are carrying out only an one-time imputation, using a relatively small reference panel (with <200 haplotypes) and focusing only on autosomes - then go with MaCH. If you are using a large reference panel, planning to re-impute your data as new reference panels become available or interested in X-chromosome imputation - then minimac is the right choice for you.
 
 
 
As of June 2011, the 1000 Genomes Panel set of reference haplotypes includes >2000 haplotypes, and you are likely to be better off using [[minimac]].
 
  
 
== Getting Started ==
 
== Getting Started ==
Line 12: Line 7:
 
=== Your Own Data ===
 
=== Your Own Data ===
  
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details, of the Merlin file format, see the [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin tutorial].
+
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details, of the Merlin file format, see the [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin tutorial].
  
 
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
 
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
  
The latest reference panel generated by the 1000 Genomes project uses NCBI Build 37 (HG 19). Make sure that your data is on Build 37 (or Minimac may ignore genotyped markers whose names have changed in build 37). If you are trying to convert your data from an earlier genome build to build 37, you'll probably find the [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz dbSNP merge table] ([http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch table description on the NCBI website]), which logs rs# changes between dbSNP builds, and the UCSC online [http://genome.ucsc.edu/cgi-bin/hgLiftOver liftOver tool], which converts genome positions between different genome builds, to be quite useful.
+
The latest reference panel generated by the 1000 Genomes project uses NCBI Build 37 (HG 19). Make sure that your data is on Build 37 (or Minimac may ignore genotyped markers whose names have changed in Build 37). If you are trying to convert your data from an earlier genome build to Build 37, you'll probably find the [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz dbSNP merge table] ([http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch table description on the NCBI website]), which logs rs# changes between dbSNP builds, and the UCSC online [http://genome.ucsc.edu/cgi-bin/hgLiftOver liftOver tool], which converts genome positions between different genome builds, to be quite useful. We have also documented ([[LiftOver | Link]]) a general procedure to convert genome positions and rs number between builds.
 +
 
 +
If you are planning to use imputation with the MetaboChip, you might find a list of SNPs whose order varies between NCBI genome Build 36 and 37 convenient. Here it is: [http://csg.sph.umich.edu/cfuchsb/metab_order_changed.txt List of Metabochip SNPs Whose Order Changes With Build]
 +
 
 +
Note: For the most recent reference panel in VCF format by default GWAS SNPs are expected to be in the chr:pos format e.g. 1:1000; otherwise, for GWAS SNPs in the rs format you have to set the --rs flag
  
 
=== Quality Control ===
 
=== Quality Control ===
Line 39: Line 38:
 
With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.
 
With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.
  
Two good ways to verify that you have used appropriate quality control steps are to generate a Q-Q plot for your dataset and to calculate a genomic control parameter. If these are not satisfactory, it may be useful to investigate before proceeding to imputation.
+
Two good ways to verify that you have used appropriate quality control steps are to generate a Q-Q plot for your dataset and to calculate a genomic control parameter. If these verifications are not satisfactory, it may be useful to investigate before proceeding to imputation.
  
 
=== Reference Haplotypes ===
 
=== Reference Haplotypes ===
  
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ MaCH download page]. In our hands, it is ideal to always use the most recent release since generation of additional sequence data, improvements in variant discovery, genotyping and haplotyping strategies typically result in noticeable improvements in data quality. When this page was written, the most recent release of 1000 Genomes project haplotypes was the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/1000G-PhaseI-Interim.html Interim Phase I release], including 1094 individuals genotyped at >37 million autosomal sites.
+
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://csg.sph.umich.edu/abecasis/MaCH/download/ MaCH download page]. In our hands, it is ideal to always use the most recent release since generation of additional sequence data, improvements in variant discovery, genotyping and haplotyping strategies typically result in noticeable improvements in data quality. When this page was written, the most recent release of 1000 Genomes project haplotypes was the [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html Integrated Phase I release], including 1092 individuals. We recommend using the reduced GIANT all population panel (no monomorphic and singleton sites)
 +
[ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz]
 +
 
 +
Furthermore, we provide also a [http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002793 Metabochip] specific reference panel ([ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.metabo.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz GIANT.metabo.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz]). This reduced reference panel is a time saver, since it focuses on the well-imputable fine-mapping regions.
  
 
== Minimac Imputation ==
 
== Minimac Imputation ==
Line 53: Line 55:
 
A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:
 
A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:
  
   mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5
+
   mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps
  
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the MaCH website):
+
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the [http://csg.sph.umich.edu/abecasis/MaCH/ MaCH website]):
  
 
{| class="wikitable" border="1" cellpadding="2"
 
{| class="wikitable" border="1" cellpadding="2"
Line 63: Line 65:
 
|-  
 
|-  
 
|style=white-space:nowrap|<code>-d sample.dat</code>
 
|style=white-space:nowrap|<code>-d sample.dat</code>
| Data file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
+
| Data file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
 
|-  
 
|-  
 
| <code>-p sample.ped</code>
 
| <code>-p sample.ped</code>
| Pedigree file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
+
| Pedigree file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
 
|-
 
|-
 
| <code>--states 200</code>
 
| <code>--states 200</code>
Line 93: Line 95:
 
   foreach chr (`seq 1 22`)
 
   foreach chr (`seq 1 22`)
  
     runon -m 4096 mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --interim 5 --sample 5
+
     runon -m 4096 mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps
  
 
   end
 
   end
Line 100: Line 102:
 
=== Imputation into Phased Haplotypes - minimac ===
 
=== Imputation into Phased Haplotypes - minimac ===
  
Imputing genotypes using '''minimac''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly.
+
Imputing genotypes using '''minimac''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include an ''aliases'' file that provides mappings between earlier marker names and the current preferred name for each polymorphism.
  
 +
A typical minimac command line, where the string $chr should be replaced with an appropriate chromosome number, might look like this:
  
A typical minimac command line might look like this:
+
== using a VCF reference panel ==
 
+
minimac --vcfReference --refHaps ref.vcf.gz --haps target.hap.gz --snps target.snps.gz --rounds 5 --states 200 --prefix results
  minimac --refHaps ref.hap.$chr.gz --refSnps ref.snps.$chr.gz --haps target.hap.$chr.gz --snps target.snps.$chr.gz --rounds 5 --states 200 --prefix results
+
Note: GWAS SNPs (file --snps target.snps.gz) are by default expected to be in the chr:pos format e.g. 1:1000 and on build37/hg19;
 +
          otherwise, please set the --rs flag and include an aliases file --snpAliase [http://www.sph.umich.edu/csg/abecasis/downloads/dbsnp134-merges.txt.gz dbsnp134-merges.txt.gz]
  
 +
 
A detailed description of all minimac options is available [[Minimac Command Reference|elsewhere]]. Here is a brief description of the above parameters:
 
A detailed description of all minimac options is available [[Minimac Command Reference|elsewhere]]. Here is a brief description of the above parameters:
  
Line 113: Line 118:
 
! Parameter  
 
! Parameter  
 
! Description
 
! Description
|-
 
| <code>--refSnps ref.snps.gz </code>
 
| List of SNPs in the reference panel
 
 
|-  
 
|-  
 
| <code>--refHaps ref.hap.gz </code>  
 
| <code>--refHaps ref.hap.gz </code>  
| Reference haplotypes (e.g. from [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-2010-06.html MaCH download page])
+
| Reference haplotypes (e.g. from [http://csg.sph.umich.edu/abecasis/MACH/download/1000G-2010-06.html MaCH download page])
 +
|-
 +
| <code>--vcfReference </code>
 +
| This option specifies that the provided --refHaps file is provided in VCF format , no --refSNPs file needed.
 
|-  
 
|-  
| <code>--snps target.snps.gz </code>
+
| <code>--snps chr.snps </code>
 
| SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.
 
| SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.
 
|-  
 
|-  
| <code>--haps target.hap.gz </code>
+
| <code>--haps chr.haps.gz </code>
 
| Phased haplotypes where missing genotypes will be imputed.
 
| Phased haplotypes where missing genotypes will be imputed.
|-  
+
<!-- |-  
 
| <code>--rounds 5</code>
 
| <code>--rounds 5</code>
 
| Rounds of optimization for model parameters, which describe population recombination rates and per SNP error rates.
 
| Rounds of optimization for model parameters, which describe population recombination rates and per SNP error rates.
Line 131: Line 136:
 
| <code>--states 200</code>
 
| <code>--states 200</code>
 
| Maximum number of reference (or target) haplotypes to be examined during parameter optimization.
 
| Maximum number of reference (or target) haplotypes to be examined during parameter optimization.
 +
!-->
 
|-  
 
|-  
 
| <code>--prefix imputed</code>
 
| <code>--prefix imputed</code>
Line 137: Line 143:
  
  
Again, you can speed-up things by running this step in parallel and in our cluster we'd use:
+
Again, you can speed up things by running imputation in parallel. For example, in our cluster we'd use:
  
 
<source lang="text">
 
<source lang="text">
 
   foreach chr (`seq 1 22`)
 
   foreach chr (`seq 1 22`)
  
     runon -m 1024 minimac --refHaps ref.hap.$chr.gz --refSnps ref.snps.$chr.gz --haps target.hap.$chr.gz --snps target.snps.$chr.gz --rounds 5 --states 200 --prefix results
+
     runon -m 1024 minimac --refHaps ref.hap.$chr.gz --vcfReference  \
 +
                          --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                         
 +
  end
 +
</source>
 +
 
 +
 
 +
Furthermore, minimac itself can be run in parallel by launching the [http://genome.sph.umich.edu/wiki/Minimac#Multiprocessor_Version minimac-omp] version. On our cluster 4 cpus per minimac is optimal (--cpus 4).
 +
 
 +
<source lang="text">
 +
  foreach chr (`seq 1 22`)
  
 +
    runon -m 1024 minimac-omp --cpus 4 --refHaps ref.hap.$chr.gz  --vcfReference  \
 +
                          --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                   
 
   end
 
   end
 
</source>
 
</source>
  
 +
== Further Time Savings ==
 +
 +
=== Target based chunking ===
 +
 +
The recipe above imputes whole chromosomes, one a time. A further time savings is possible by imputing chromosomes in chunks, a process that can be facilitated using the [[ChunkChromosome]] tool. This tool automates some of the manual editing steps that would be required to divide each chromosome into more manageable chunks. It also interfaces with minimac to ensure SNPs that overlap between chunks are only imputed once.
 +
 +
Recall that the original analysis script might have looked like this:
 +
 +
<source lang="bash">
 +
#!/bin/tcsh
 +
 +
# Build SNP lists for each chromosome
 +
foreach chr (`seq 1 22`)
 +
  awk '{ if ($1 == "M") print $2; }' < chr${chr}.dat > chr${chr}.snps
 +
end
 +
 +
# Estimate haplotypes for all individuals, one chromosome at a time
 +
foreach chr (`seq 1 22`)
 +
  mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --sample 5 --prefix chr$chr.haps >& chr$chr-mach.log &
 +
end
 +
wait
 +
 +
# Impute into phased haplotypes
 +
foreach chr (`seq 1 22`)
 +
  minimac --refHaps ref.hap.$chr.gz  --vcfReference  \
 +
          --haps chr$chr.haps.gz --snps chr$chr.snps --prefix chr$chr.imputed >& chr$chr-minimac.log &
 +
end
 +
wait
 +
</source>
 +
 +
A modified script, using [[ChunkChromosome]] would look like this (chr:pos GWAS SNP IDs):
 +
 +
<source lang="bash">
 +
#!/bin/tcsh
 +
 +
@ length = 2500
 +
@ overlap = 500
 +
 +
# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
 +
foreach chr (`seq 1 22`)
 +
 +
  ChunkChromosome -d chr$chr.dat -n $length -o $overlap
 +
 +
  foreach chunk (chunk*-chr$chr.dat)
 +
 +
      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
 +
          --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &
 +
 +
  end
 +
 +
end
 +
wait
 +
 +
# Impute into phased haplotypes
 +
foreach chr (`seq 1 22`)
 +
 +
  foreach chunk (chunk*-chr$chr.dat)
 +
 +
      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
 +
      set snps = /data/1000g/snps/chr$chr.snps
 +
 +
      minimac --refHaps $haps --refSnps $snps  --vcfReference  \
 +
              --haps ${chunk:r}.gz --snps ${chunk}.snps  --autoClip autoChunk-chr$chr.dat  \
 +
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &
 +
 +
  end
 +
 +
end
 +
wait
 +
 +
</source>
 +
 +
 +
Using [[ChunkChromosome]] would look like this (rs GWAS SNP IDs):
 +
 +
<source lang="bash">
 +
#!/bin/tcsh
 +
 +
@ length = 2500
 +
@ overlap = 500
 +
 +
# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
 +
foreach chr (`seq 1 22`)
 +
 +
  ChunkChromosome -d chr$chr.dat -n $length -o $overlap
 +
 +
  foreach chunk (chunk*-chr$chr.dat)
 +
 +
      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
 +
          --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &
 +
 +
  end
 +
 +
end
 +
wait
 +
 +
# Impute into phased haplotypes
 +
foreach chr (`seq 1 22`)
 +
 +
  foreach chunk (chunk*-chr$chr.dat)
 +
 +
      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
 +
      set snps = /data/1000g/snps/chr$chr.snps
 +
 +
      minimac --refHaps $haps --refSnps $snps  --vcfReference  --rs --snpAliases dbsnp134-merges.txt.gz \
 +
              --haps ${chunk:r}.gz --snps ${chunk}.snps  --autoClip autoChunk-chr$chr.dat  \
 +
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &
 +
 +
  end
 +
 +
end
 +
wait
 +
 +
</source>
 +
 +
Although the loss in accuracy will typically be small, analyses of small chromosome chunks will necessarily result in slightly less accurate results. The differences will be large in populations where haplotype sharing is expected to extend further (for example, in Finland and Sardinia we find that haplotypes can be estimated accurately over several megabases and focusing on small chunks of chromosome will make these matches harder to identify).
 +
 +
=== Reference based chunking ===
 +
 +
REMARK: this works only with SNPIDs in the chr:pos format, or in other words, rs numbers are not supported here.
 +
 +
In case you didn't split your data with ChunkChromosome, we highly recommend running imputation using our reference based chunking method (minimac.11.16.12 or newer required)
 +
 +
<source lang="bash">
 +
#!/bin/sh
 +
 +
chromend=57000000 # for chr 20
 +
chunksize=7000000
 +
overlap=250000
 +
vcfref="your_vcf_ref.vcf"
 +
gwas="your_gwas_filename"
 +
out="your_output_filename"
 +
 +
for start in `seq 0 $chunksize $chromend`
 +
do
 +
 +
  end=$(($start+$chunksize))
 +
  start=$((start+1))
 +
 +
  minimac --vcfReference \
 +
  --refHaps $vcfref \
 +
  --snps ${gwas}.snps --haps ${gwas}.haps.gz \
 +
  --vcfstart $start --vcfend $end \
 +
  --vcfwindow $overlap \
 +
  --prefix $out.$start.$end > $out.$start.$end.log &
 +
 +
done
 +
 +
</source>
  
 
== X Chromosome Imputation ==
 
== X Chromosome Imputation ==
minimac supports the imputation of genotypes on the X chromosome (non-pseudo-autosomal part).
+
minimac (version 9.22.12 or newer) supports the imputation of genotypes on the X chromosome
 +
(reference haplotypes can be found [ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz here])
  
# Split the X chromosome pedigree file by sex.
+
X chromosome imputation consists of the following 3 steps:
## For females: follow the same protocol as for autosomes (phase with MaCH and impute with minimac).
 
## For males
 
### Remove the pseudo-autosomal part (for build hg18: chrX:1-2709520 and chrX:154584238-154913754 ; for build hg19 chrX:60001-2699520 and chrX:154931044-155260560)
 
### Convert the pedigree file into a MaCH haplotype file (missing genotypes should be encoded as: "0" or "." or "N" ) and impute using minimac as described above.
 
  
 +
=== 1. Split X chromosome pedigree file by sex  ===
 +
 +
# For females: follow the same protocol as for autosomes
 +
# For males: they have only one X chromosome and are therefore already phased. Simply convert the pedigree file into a MaCH haplotype file (missing genotypes should be encoded as:  "0" or "." or "N" )
  
  
Line 173: Line 340:
 
::::  ...
 
::::  ...
 
::::  '''<End of the corresponding haplotype file>'''
 
::::  '''<End of the corresponding haplotype file>'''
 +
 +
===  2. Impute non-pseudo-autosomal part by sex ===
 +
 +
Impute females and males separately
 +
 +
minimac  --refHaps chrX.no.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
 +
                --haps females.hap --snps females.snps
 +
minimac  --refHaps chrX.no.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
 +
                --haps males.hap --snps males.snps
 +
 +
===  3. Impute pseudo-autosomal part by sex ===
 +
 +
Impute females and males separately
 +
 +
minimac  --refHaps chrX.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
 +
                --haps females.hap --snps females.snps
 +
minimac  --refHaps chrX.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
 +
                --haps males.hap --snps males.snps
  
 
== Imputation quality evaluation ==
 
== Imputation quality evaluation ==
Line 183: Line 368:
  
 
Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).
 
Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).
 +
 +
= Reference =
 +
 +
If you use minimac, please cite:
 +
 +
Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR.
 +
Fast and accurate genotype imputation in genome-wide association studies
 +
through pre-phasing. Nature Genetics 2012 [http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2354.html]
  
 
== Questions and Comments ==
 
== Questions and Comments ==
  
Please contact [mailto:goncalo@umich.edu Goncalo Abecasis] or [mailto:cfuchsb@umich.edu Christian Fuchsberger (minimac)] or [mailto:yunli@med.unc.edu Yun Li (MaCH)].
+
Please contact [mailto:goncalo@umich.edu Goncalo Abecasis], [mailto:cfuchsb@umich.edu Christian Fuchsberger (minimac)] or [mailto:yunli@med.unc.edu Yun Li (MaCH)].

Latest revision as of 03:40, 28 January 2015

This page documents how to impute 1000 Genome SNPs using Minimac, which is typically the preferred approach for imputation using large reference panels such as the 1000 Genomes data. For pointers on how to carry out 1000 Genomes Imputation using IMPUTE2, see IMPUTE2:_1000_Genomes_Imputation_Cookbook.

Before reading this tutorial, you might find it useful to spend a few minutes reading through the main Minimac documentation.

Getting Started

Your Own Data

To get started, you will need to store your data in Merlin format pedigree and data files, one per chromosome. For details, of the Merlin file format, see the Merlin tutorial.

Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).

The latest reference panel generated by the 1000 Genomes project uses NCBI Build 37 (HG 19). Make sure that your data is on Build 37 (or Minimac may ignore genotyped markers whose names have changed in Build 37). If you are trying to convert your data from an earlier genome build to Build 37, you'll probably find the dbSNP merge table (table description on the NCBI website), which logs rs# changes between dbSNP builds, and the UCSC online liftOver tool, which converts genome positions between different genome builds, to be quite useful. We have also documented ( Link) a general procedure to convert genome positions and rs number between builds.

If you are planning to use imputation with the MetaboChip, you might find a list of SNPs whose order varies between NCBI genome Build 36 and 37 convenient. Here it is: List of Metabochip SNPs Whose Order Changes With Build

Note: For the most recent reference panel in VCF format by default GWAS SNPs are expected to be in the chr:pos format e.g. 1:1000; otherwise, for GWAS SNPs in the rs format you have to set the --rs flag

Quality Control

You should apply standard quality control filters to the set of SNPs that you use as input to the imputation procedure. These filters are typically study specific but usually a series of per marker and per individual quality checks.

Per marker quality checks typically evaluate:

  • Genotype completeness (markers with relatively high missingness rates are typically excluded)
  • Hardy Weinberg equilibrium (markers with clear deviations from Hardy Weinberg proportions are typically excluded)
  • Duplicate concordance (markers with high discordance rates are excluded)
  • Mendelian inconsistencies (when parent-offspring pairs are available, markers with an excess of Mendelian inconsistencies are typically excluded).
  • Polymorphism check (monomorphic markers can represent assay failures and are typically excluded).

Per individual quality checks typically include:

  • Check for per sample completeness (samples with high missingness rates are typically excluded)
  • Check for per sample heterozygosity (samples with unusual heterozygosity are typically excluded)
  • Inspection of principal component or multi-dimensional scaling plots (genetic ancestry outliers are usually excluded)
  • Check for related individuals (if related samples are identified, the analysis plan must be adjusted or they must be removed)

With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.

Two good ways to verify that you have used appropriate quality control steps are to generate a Q-Q plot for your dataset and to calculate a genomic control parameter. If these verifications are not satisfactory, it may be useful to investigate before proceeding to imputation.

Reference Haplotypes

Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the MaCH download page. In our hands, it is ideal to always use the most recent release since generation of additional sequence data, improvements in variant discovery, genotyping and haplotyping strategies typically result in noticeable improvements in data quality. When this page was written, the most recent release of 1000 Genomes project haplotypes was the Integrated Phase I release, including 1092 individuals. We recommend using the reduced GIANT all population panel (no monomorphic and singleton sites) GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz

Furthermore, we provide also a Metabochip specific reference panel (GIANT.metabo.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz). This reduced reference panel is a time saver, since it focuses on the well-imputable fine-mapping regions.

Minimac Imputation

Minimac relies on a two step approach. First, the samples that are to be analyzed must be phased into a series of estimated haplotypes. Second, imputation is carried out directly into these phased haplotypes. As newer reference panels become available, only the second step must be repeated.

Pre-phasing - MaCH

A convenient way to haplotype your sample is to use MaCH. A typical MaCH command line to estimate phased haplotypes might look like this:

 mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps

This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the MaCH website):

Parameter Description
-d sample.dat Data file in Merlin format. Markers should be listed according to their order along the chromosome.
-p sample.ped Pedigree file in Merlin format. Alleles should be labeled on the forward strand.
--states 200 Number of haplotypes to consider during each update. Increasing this value will typically lead to better haplotypes, but can dramatically increase computing time and memory use. A value of 200 - 400 is typical. If computing time is a constraints, you may consider decreasing this parameter to 100. If you have substantial computing resources, consider increasing this value to 600 or even 800.
--rounds 20 Iterations of the Markov sampler to use for haplotyping. Typically, using 20-30 rounds should give good results. To obtain better results, it is usually better to increase the --states parameter.
--interim 5 Request that intermediate results should be saved to disk periodically. These will facilitate analyses in case a run doesn't complete.
--sample 5 Request that random (but plausible) sets of haplotypes for each individual should be drawn every 5 iterations. This parameter is optional, but for some rare variant analyses, these alternative haplotypes can be very useful. If you are not short of disk space, you should consider enabling this parameter.
--phase Tell MaCH to estimate phased haplotypes for each individual.
--compact Reduce memory use at the cost of approximately doubling runtime.


You should be able to run this step in parallel and in our cluster we'd use:

   foreach chr (`seq 1 22`)

     runon -m 4096 mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps

   end

Imputation into Phased Haplotypes - minimac

Imputing genotypes using minimac is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include an aliases file that provides mappings between earlier marker names and the current preferred name for each polymorphism.

A typical minimac command line, where the string $chr should be replaced with an appropriate chromosome number, might look like this:

== using a VCF reference panel ==
minimac --vcfReference --refHaps ref.vcf.gz --haps target.hap.gz --snps target.snps.gz --rounds 5 --states 200 --prefix results
Note: GWAS SNPs (file --snps target.snps.gz) are by default expected to be in the chr:pos format e.g. 1:1000 and on build37/hg19; 
         otherwise, please set the --rs flag and include an aliases file --snpAliase dbsnp134-merges.txt.gz


A detailed description of all minimac options is available elsewhere. Here is a brief description of the above parameters:

Parameter Description
--refHaps ref.hap.gz Reference haplotypes (e.g. from MaCH download page)
--vcfReference This option specifies that the provided --refHaps file is provided in VCF format , no --refSNPs file needed.
--snps chr.snps SNPs in phased haplotypes. These should largely be a subset of the SNPs in the reference panel.
--haps chr.haps.gz Phased haplotypes where missing genotypes will be imputed.
--prefix imputed Optionally, a string that is used to help generate output file names.


Again, you can speed up things by running imputation in parallel. For example, in our cluster we'd use:

   foreach chr (`seq 1 22`)

     runon -m 1024 minimac --refHaps ref.hap.$chr.gz  --vcfReference  \
                           --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                          
   end


Furthermore, minimac itself can be run in parallel by launching the minimac-omp version. On our cluster 4 cpus per minimac is optimal (--cpus 4).

   foreach chr (`seq 1 22`)

     runon -m 1024 minimac-omp --cpus 4 --refHaps ref.hap.$chr.gz  --vcfReference  \
                           --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                    
   end

Further Time Savings

Target based chunking

The recipe above imputes whole chromosomes, one a time. A further time savings is possible by imputing chromosomes in chunks, a process that can be facilitated using the ChunkChromosome tool. This tool automates some of the manual editing steps that would be required to divide each chromosome into more manageable chunks. It also interfaces with minimac to ensure SNPs that overlap between chunks are only imputed once.

Recall that the original analysis script might have looked like this:

#!/bin/tcsh

# Build SNP lists for each chromosome
foreach chr (`seq 1 22`)
   awk '{ if ($1 == "M") print $2; }' < chr${chr}.dat > chr${chr}.snps
end

# Estimate haplotypes for all individuals, one chromosome at a time
foreach chr (`seq 1 22`)
   mach -d chr$chr.dat -p chr$chr.ped --rounds 20 --states 200 --phase --sample 5 --prefix chr$chr.haps >& chr$chr-mach.log &
end
wait

# Impute into phased haplotypes
foreach chr (`seq 1 22`)
   minimac --refHaps ref.hap.$chr.gz  --vcfReference  \
           --haps chr$chr.haps.gz --snps chr$chr.snps --prefix chr$chr.imputed >& chr$chr-minimac.log &
end
wait

A modified script, using ChunkChromosome would look like this (chr:pos GWAS SNP IDs):

#!/bin/tcsh

@ length = 2500
@ overlap = 500

# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
foreach chr (`seq 1 22`)

   ChunkChromosome -d chr$chr.dat -n $length -o $overlap

   foreach chunk (chunk*-chr$chr.dat)

      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
           --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &

   end

end
wait

# Impute into phased haplotypes
foreach chr (`seq 1 22`)

   foreach chunk (chunk*-chr$chr.dat)

      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
      set snps = /data/1000g/snps/chr$chr.snps

      minimac --refHaps $haps --refSnps $snps  --vcfReference  \
              --haps ${chunk:r}.gz --snps ${chunk}.snps  --autoClip autoChunk-chr$chr.dat  \
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &

   end

end
wait


Using ChunkChromosome would look like this (rs GWAS SNP IDs):

#!/bin/tcsh

@ length = 2500
@ overlap = 500

# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
foreach chr (`seq 1 22`)

   ChunkChromosome -d chr$chr.dat -n $length -o $overlap

   foreach chunk (chunk*-chr$chr.dat)

      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
           --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &

   end

end
wait

# Impute into phased haplotypes
foreach chr (`seq 1 22`)

   foreach chunk (chunk*-chr$chr.dat)

      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
      set snps = /data/1000g/snps/chr$chr.snps

      minimac --refHaps $haps --refSnps $snps  --vcfReference  --rs --snpAliases dbsnp134-merges.txt.gz \
              --haps ${chunk:r}.gz --snps ${chunk}.snps  --autoClip autoChunk-chr$chr.dat  \
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &

   end

end
wait

Although the loss in accuracy will typically be small, analyses of small chromosome chunks will necessarily result in slightly less accurate results. The differences will be large in populations where haplotype sharing is expected to extend further (for example, in Finland and Sardinia we find that haplotypes can be estimated accurately over several megabases and focusing on small chunks of chromosome will make these matches harder to identify).

Reference based chunking

REMARK: this works only with SNPIDs in the chr:pos format, or in other words, rs numbers are not supported here.

In case you didn't split your data with ChunkChromosome, we highly recommend running imputation using our reference based chunking method (minimac.11.16.12 or newer required)

#!/bin/sh

chromend=57000000 # for chr 20 
chunksize=7000000
overlap=250000
vcfref="your_vcf_ref.vcf"
gwas="your_gwas_filename"
out="your_output_filename"

for start in `seq 0 $chunksize $chromend`
do

   end=$(($start+$chunksize))
   start=$((start+1))

   minimac --vcfReference \
   --refHaps $vcfref \
   --snps ${gwas}.snps --haps ${gwas}.haps.gz \
   --vcfstart $start --vcfend $end \
   --vcfwindow $overlap \
   --prefix $out.$start.$end > $out.$start.$end.log &

done

X Chromosome Imputation

minimac (version 9.22.12 or newer) supports the imputation of genotypes on the X chromosome (reference haplotypes can be found here)

X chromosome imputation consists of the following 3 steps:

1. Split X chromosome pedigree file by sex

  1. For females: follow the same protocol as for autosomes
  2. For males: they have only one X chromosome and are therefore already phased. Simply convert the pedigree file into a MaCH haplotype file (missing genotypes should be encoded as: "0" or "." or "N" )


<Example of a male only pedigree file; hemizygous males appear as homozygotes >
FAM1003 ID1234 0 0 M A/A A/A C/C
FAM1004 ID5678 0 0 M 0/0 C/C G/G
...
<End of pedigree file>


<Example of the corresponding haplotype file>
FAM1003->ID1234 HAPLO1 AAC
FAM1003->ID1234 HAPLO2 AAC
FAM1004->ID5678 HAPLO1 0CG
FAM1004->ID5678 HAPLO2 0CG
...
<End of the corresponding haplotype file>

2. Impute non-pseudo-autosomal part by sex

Impute females and males separately

minimac  --refHaps chrX.no.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
               --haps females.hap --snps females.snps
minimac  --refHaps chrX.no.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
               --haps males.hap --snps males.snps

3. Impute pseudo-autosomal part by sex

Impute females and males separately

minimac  --refHaps chrX.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
               --haps females.hap --snps females.snps
minimac  --refHaps chrX.auto.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz --vcfReference
               --haps males.hap --snps males.snps

Imputation quality evaluation

Minimac hides each of the genotyped SNPs in turn and then calculates 3 statistics:

  • looRSQ - this is the estimated rsq for that SNP (as if SNP weren't typed).
  • empR - this is the empirical correlation between true and imputed genotypes for the SNP. If this is negative, the SNP alleles are probably flipped.
  • empRSQ - this is the actual R2 value, comparing imputed and true genotypes.

These statistics can be found in the *.info file

Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).

Reference

If you use minimac, please cite:

Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 2012 [1]

Questions and Comments

Please contact Goncalo Abecasis, Christian Fuchsberger (minimac) or Yun Li (MaCH).