Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 7: Line 7:  
=== Your Own Data ===
 
=== Your Own Data ===
   −
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details, of the Merlin file format, see the [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin tutorial].
+
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details, of the Merlin file format, see the [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin tutorial].
    
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
 
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele).  
   −
The latest reference panel generated by the 1000 Genomes project uses NCBI Build 37 (HG 19). Make sure that your data is on Build 37 (or Minimac may ignore genotyped markers whose names have changed in build 37). If you are trying to convert your data from an earlier genome build to build 37, you'll probably find the [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz dbSNP merge table] ([http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch table description on the NCBI website]), which logs rs# changes between dbSNP builds, and the UCSC online [http://genome.ucsc.edu/cgi-bin/hgLiftOver liftOver tool], which converts genome positions between different genome builds, to be quite useful. We have aslo documented ([[LiftOver | Link]]) a general procedure to convert genome positions and rs number between builds.
+
The latest reference panel generated by the 1000 Genomes project uses NCBI Build 37 (HG 19). Make sure that your data is on Build 37 (or Minimac may ignore genotyped markers whose names have changed in Build 37). If you are trying to convert your data from an earlier genome build to Build 37, you'll probably find the [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz dbSNP merge table] ([http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch table description on the NCBI website]), which logs rs# changes between dbSNP builds, and the UCSC online [http://genome.ucsc.edu/cgi-bin/hgLiftOver liftOver tool], which converts genome positions between different genome builds, to be quite useful. We have also documented ([[LiftOver | Link]]) a general procedure to convert genome positions and rs number between builds.
   −
If you are planning to use imputation with the MetaboChip, you might find a list of SNPs whose order varies between NCBI genome build 36 and 37 convenient. Here it is: [http://www.sph.umich.edu/csg/cfuchsb/metab_order_changed.txt List of Metabochip SNPs Whose Order Changes With Build]
+
If you are planning to use imputation with the MetaboChip, you might find a list of SNPs whose order varies between NCBI genome Build 36 and 37 convenient. Here it is: [http://csg.sph.umich.edu/cfuchsb/metab_order_changed.txt List of Metabochip SNPs Whose Order Changes With Build]
   −
Note: For the most recent reference panel in VCF format by default GWAS SNPs are expected to be in the chr:pos format e.g. 1:1000; otherwise, for GWAS SNPs in the rs..... format you have to set the --rs flag
+
Note: For the most recent reference panel in VCF format by default GWAS SNPs are expected to be in the chr:pos format e.g. 1:1000; otherwise, for GWAS SNPs in the rs format you have to set the --rs flag
    
=== Quality Control ===
 
=== Quality Control ===
Line 38: Line 38:  
With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.
 
With older genotyping platforms, low frequency SNPs are also often excluded because they are hard to genotype accurately. With more modern genotyping arrays, the accuracy of genotype calls for low frequency SNPs is less of a concern.
   −
Two good ways to verify that you have used appropriate quality control steps are to generate a Q-Q plot for your dataset and to calculate a genomic control parameter. If these are not satisfactory, it may be useful to investigate before proceeding to imputation.
+
Two good ways to verify that you have used appropriate quality control steps are to generate a Q-Q plot for your dataset and to calculate a genomic control parameter. If these verifications are not satisfactory, it may be useful to investigate before proceeding to imputation.
    
=== Reference Haplotypes ===
 
=== Reference Haplotypes ===
   −
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://www.sph.umich.edu/csg/abecasis/MaCH/download/ MaCH download page]. In our hands, it is ideal to always use the most recent release since generation of additional sequence data, improvements in variant discovery, genotyping and haplotyping strategies typically result in noticeable improvements in data quality. When this page was written, the most recent release of 1000 Genomes project haplotypes was the [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html Integrated Phase I release], including 1092 individuals. We recommend to use the reduced GIANT all population panel (no monomorphic and singleton sites)
+
Reference haplotypes generated by the 1000 Genomes project and formatted so that they are ready for analysis are available from the [http://csg.sph.umich.edu/abecasis/MaCH/download/ MaCH download page]. In our hands, it is ideal to always use the most recent release since generation of additional sequence data, improvements in variant discovery, genotyping and haplotyping strategies typically result in noticeable improvements in data quality. When this page was written, the most recent release of 1000 Genomes project haplotypes was the [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html Integrated Phase I release], including 1092 individuals. We recommend using the reduced GIANT all population panel (no monomorphic and singleton sites)
 
[ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz]
 
[ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz GIANT.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz]
   Line 57: Line 57:  
   mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps
 
   mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --interim 5 --sample 5 --prefix chr$chr.haps
   −
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the MaCH website):
+
This will request that MaCH estimate haplotypes for your sample, using 20 iterations of its Markov sampler and conditioning each update on up to 200 haplotypes. A summary description of these parameters follows (but for a more complete description, you should go to the [http://csg.sph.umich.edu/abecasis/MaCH/ MaCH website]):
    
{| class="wikitable" border="1" cellpadding="2"
 
{| class="wikitable" border="1" cellpadding="2"
Line 65: Line 65:  
|-  
 
|-  
 
|style=white-space:nowrap|<code>-d sample.dat</code>
 
|style=white-space:nowrap|<code>-d sample.dat</code>
| Data file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
+
| Data file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Markers should be listed according to their order along the chromosome.
 
|-  
 
|-  
 
| <code>-p sample.ped</code>
 
| <code>-p sample.ped</code>
| Pedigree file in [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
+
| Pedigree file in [http://csg.sph.umich.edu/abecasis/Merlin/tour/input_files.html Merlin format]. Alleles should be labeled on the forward strand.
 
|-
 
|-
 
| <code>--states 200</code>
 
| <code>--states 200</code>
Line 102: Line 102:  
=== Imputation into Phased Haplotypes - minimac ===
 
=== Imputation into Phased Haplotypes - minimac ===
   −
Imputing genotypes using '''minimac''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include ''aliases'' file that provides mappings between earlier marker names and the current preferred name for each polymorphism.
+
Imputing genotypes using '''minimac''' is a straightforward process: after selecting a set of reference haplotypes, plugging-in the target haplotypes from the previous step and setting the number of rounds to use for estimating model parameters (which describe the length and conservation of haplotype stretches shared between the reference panel and your study samples), imputation should proceed rapidly. Because marker names can change between dbSNP versions, it is usually a good idea to include an ''aliases'' file that provides mappings between earlier marker names and the current preferred name for each polymorphism.
    
A typical minimac command line, where the string $chr should be replaced with an appropriate chromosome number, might look like this:
 
A typical minimac command line, where the string $chr should be replaced with an appropriate chromosome number, might look like this:
Line 109: Line 109:  
  minimac --vcfReference --refHaps ref.vcf.gz --haps target.hap.gz --snps target.snps.gz --rounds 5 --states 200 --prefix results
 
  minimac --vcfReference --refHaps ref.vcf.gz --haps target.hap.gz --snps target.snps.gz --rounds 5 --states 200 --prefix results
 
  Note: GWAS SNPs (file --snps target.snps.gz) are by default expected to be in the chr:pos format e.g. 1:1000 and on build37/hg19;  
 
  Note: GWAS SNPs (file --snps target.snps.gz) are by default expected to be in the chr:pos format e.g. 1:1000 and on build37/hg19;  
           otherwise, please set the --rs flag  
+
           otherwise, please set the --rs flag and include an aliases file --snpAliase [http://www.sph.umich.edu/csg/abecasis/downloads/dbsnp134-merges.txt.gz dbsnp134-merges.txt.gz]
    
   
 
   
Line 120: Line 120:  
|-  
 
|-  
 
| <code>--refHaps ref.hap.gz </code>  
 
| <code>--refHaps ref.hap.gz </code>  
| Reference haplotypes (e.g. from [http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-2010-06.html MaCH download page])
+
| Reference haplotypes (e.g. from [http://csg.sph.umich.edu/abecasis/MACH/download/1000G-2010-06.html MaCH download page])
 
|-
 
|-
 
| <code>--vcfReference </code>  
 
| <code>--vcfReference </code>  
Line 143: Line 143:       −
Again, you can speed-up things by running this step in parallel and in our cluster we'd use:
+
Again, you can speed up things by running imputation in parallel. For example, in our cluster we'd use:
    
<source lang="text">
 
<source lang="text">
Line 150: Line 150:  
     runon -m 1024 minimac --refHaps ref.hap.$chr.gz  --vcfReference  \
 
     runon -m 1024 minimac --refHaps ref.hap.$chr.gz  --vcfReference  \
 
                           --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                           
 
                           --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                           
 +
  end
 +
</source>
 +
 +
 +
Furthermore, minimac itself can be run in parallel by launching the [http://genome.sph.umich.edu/wiki/Minimac#Multiprocessor_Version minimac-omp] version. On our cluster 4 cpus per minimac is optimal (--cpus 4).
 +
 +
<source lang="text">
 +
  foreach chr (`seq 1 22`)
 +
 +
    runon -m 1024 minimac-omp --cpus 4 --refHaps ref.hap.$chr.gz  --vcfReference  \
 +
                          --haps chr$chr.haps.gz --snps chr.$chr.snps --prefix chr$chr.imputed                   
 
   end
 
   end
 
</source>
 
</source>
    
== Further Time Savings ==
 
== Further Time Savings ==
 +
 +
=== Target based chunking ===
    
The recipe above imputes whole chromosomes, one a time. A further time savings is possible by imputing chromosomes in chunks, a process that can be facilitated using the [[ChunkChromosome]] tool. This tool automates some of the manual editing steps that would be required to divide each chromosome into more manageable chunks. It also interfaces with minimac to ensure SNPs that overlap between chunks are only imputed once.
 
The recipe above imputes whole chromosomes, one a time. A further time savings is possible by imputing chromosomes in chunks, a process that can be facilitated using the [[ChunkChromosome]] tool. This tool automates some of the manual editing steps that would be required to divide each chromosome into more manageable chunks. It also interfaces with minimac to ensure SNPs that overlap between chunks are only imputed once.
Line 267: Line 280:     
Although the loss in accuracy will typically be small, analyses of small chromosome chunks will necessarily result in slightly less accurate results. The differences will be large in populations where haplotype sharing is expected to extend further (for example, in Finland and Sardinia we find that haplotypes can be estimated accurately over several megabases and focusing on small chunks of chromosome will make these matches harder to identify).
 
Although the loss in accuracy will typically be small, analyses of small chromosome chunks will necessarily result in slightly less accurate results. The differences will be large in populations where haplotype sharing is expected to extend further (for example, in Finland and Sardinia we find that haplotypes can be estimated accurately over several megabases and focusing on small chunks of chromosome will make these matches harder to identify).
 +
 +
=== Reference based chunking ===
 +
 +
REMARK: this works only with SNPIDs in the chr:pos format, or in other words, rs numbers are not supported here.
 +
 +
In case you didn't split your data with ChunkChromosome, we highly recommend running imputation using our reference based chunking method (minimac.11.16.12 or newer required)
 +
 +
<source lang="bash">
 +
#!/bin/sh
 +
 +
chromend=57000000 # for chr 20
 +
chunksize=7000000
 +
overlap=250000
 +
vcfref="your_vcf_ref.vcf"
 +
gwas="your_gwas_filename"
 +
out="your_output_filename"
 +
 +
for start in `seq 0 $chunksize $chromend`
 +
do
 +
 +
  end=$(($start+$chunksize))
 +
  start=$((start+1))
 +
 +
  minimac --vcfReference \
 +
  --refHaps $vcfref \
 +
  --snps ${gwas}.snps --haps ${gwas}.haps.gz \
 +
  --vcfstart $start --vcfend $end \
 +
  --vcfwindow $overlap \
 +
  --prefix $out.$start.$end > $out.$start.$end.log &
 +
 +
done
 +
 +
</source>
    
== X Chromosome Imputation ==
 
== X Chromosome Imputation ==
minimac (version 8.15.12 or newer) supports the imputation of genotypes on the X chromosome
+
minimac (version 9.22.12 or newer) supports the imputation of genotypes on the X chromosome
 
(reference haplotypes can be found [ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz here])
 
(reference haplotypes can be found [ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/GIANT.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz.tgz here])
 +
 +
X chromosome imputation consists of the following 3 steps:
    
=== 1. Split X chromosome pedigree file by sex  ===
 
=== 1. Split X chromosome pedigree file by sex  ===
    
# For females: follow the same protocol as for autosomes
 
# For females: follow the same protocol as for autosomes
# For males
+
# For males: they have only one X chromosome and are therefore already phased. Simply convert the pedigree file into a MaCH haplotype file (missing genotypes should be encoded as:  "0" or "." or "N" )
## Convert the pedigree file into a MaCH haplotype file (missing genotypes should be encoded as:  "0" or "." or "N" )
        Line 322: Line 369:  
Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).
 
Be aware that, unfortunately, imputation quality statistics are not directly comparable between different imputation programs (MaCH/minimac vs. Impute vs. Beagle etc.).
   −
== GWAS result QC and SNP harmonization ==
+
= Reference =
 
  −
More details to follow; in the meantime please check our GWAtoolbox R-package out:
  −
 
  −
[http://www.eurac.edu/en/research/institutes/centerofbiomedicine/Software/GWAtoolbox.html GWAtoolbox]
      +
If you use minimac, please cite:
    +
Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR.
 +
Fast and accurate genotype imputation in genome-wide association studies
 +
through pre-phasing. Nature Genetics 2012 [http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2354.html]
    
== Questions and Comments ==
 
== Questions and Comments ==
    
Please contact [mailto:goncalo@umich.edu Goncalo Abecasis], [mailto:cfuchsb@umich.edu Christian Fuchsberger (minimac)] or [mailto:yunli@med.unc.edu Yun Li (MaCH)].
 
Please contact [mailto:goncalo@umich.edu Goncalo Abecasis], [mailto:cfuchsb@umich.edu Christian Fuchsberger (minimac)] or [mailto:yunli@med.unc.edu Yun Li (MaCH)].
550

edits

Navigation menu