Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 8: Line 8:     
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 +
For information on how to carry out 1000 Genomes Imputation using [[Minimac]], see [[Minimac:_1000_Genomes_Imputation_Cookbook]].
 +
 +
As of now, the most recent version of the 1000 Genome Project haplotypes are the Phase I Integrated release haplotypes (available in IMPUTE2 format from [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html this link]). This haplotype set is more complete and more accurate than prior 1000 Genome Project haplotypes and is recommended for all imputation based analyses.
    
= Before Imputation =
 
= Before Imputation =
Line 33: Line 36:  
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
 
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
   −
= Pre-Phasing =
+
If you use SHAPEIT for pre-phasing (see next section), you do not have to convert autosomal chromosomes to IMPUTE format because SHAPEIT can take input data in either PLINK PED or BED format, or IMPUTE format.
 +
 
 +
= Pre-Phasing (autosomal chromosomes only) =
    
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that  
 
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that  
 
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.  
 
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.  
   −
Drs. Bryan Howie and Jonathan Marchini provide examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].
+
Drs. Bryan Howie and Jonathan Marchini provide descriptive and examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].  
 +
 
 +
There are two programs that can be used for pre-phasing: IMPUTE2 or SHAPEIT. Drs. Bryan Howie and Jonathan Marchini recommend SHAPEIT which uses an accurate phasing method. Output from SHAPEIT is in IMPUTE2 format for next imputation step.
 +
 
    
== Sliding Window Analyses ==
 
== Sliding Window Analyses ==
Line 48: Line 56:  
Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
 
Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
   −
== Pre-Phasing using IMPUTE2 ===
+
== Pre-Phasing using IMPUTE2 ==
   −
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory and estimated haplotypes in a <code>sampled_haps</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <source>-allow_large_regions</source> parameter.
+
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] and an outdated document on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <code>-allow_large_regions</code> parameter and to set the effective population size NE=20,000 (as recommended in the IMPUTE2 website for the more recent 1000 Genome Project builds).
    
<source lang="bash">
 
<source lang="bash">
Line 64: Line 72:  
DATA_DIR=${ROOT_DIR}data_files/
 
DATA_DIR=${ROOT_DIR}data_files/
 
RESULTS_DIR=${ROOT_DIR}results/
 
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
     −
# executable
+
# executable (must be IMPUTE version 2.2.0 or later)
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
    
# parameters
 
# parameters
K=80
+
NE=20000
ITER=30
  −
BURNIN=10
  −
NE=11500
      
# reference data files
 
# reference data files
Line 86: Line 90:  
## phase GWAS genotypes
 
## phase GWAS genotypes
 
$IMPUTE2_EXEC \
 
$IMPUTE2_EXEC \
 +
    -prephase_g \
 
     -m $GENMAP_FILE \
 
     -m $GENMAP_FILE \
 
     -g $GWAS_GTYPE_FILE \
 
     -g $GWAS_GTYPE_FILE \
    -phase \
  −
    -include_buffer_in_output \
  −
    -stage_one \
  −
    -hap_samp_dir $HAP_SAMP_DIR \
  −
    -k $K \
  −
    -iter $ITER \
  −
    -burnin $BURNIN \
   
     -Ne $NE \
 
     -Ne $NE \
 
     -int $CHUNK_START $CHUNK_END \
 
     -int $CHUNK_START $CHUNK_END \
-o $OUTPUT_FILE \
+
    -o $OUTPUT_FILE \
-allow_large_regions
+
    -allow_large_regions
 
</source>
 
</source>
   −
where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using SGE (sun grid engine) for parallel analysis.  The batch file is something like
+
Then you can submit batch jobs using a grid engine for parallel analysis.  The series of commands to run would look something like:
    
<source lang="bash">
 
<source lang="bash">
Line 109: Line 107:  
</source>
 
</source>
   −
where sge is SGE command for submitting jobs (it may be qsub in your system), the three numbers are chr and the pair of chunk region.
+
where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</code>). The three numbers correspond to chromosome and chunk start and end positions.
   −
Timing: with 30 nodes cluster, it took 7 hours for a 1400 samples data.
+
= Pre-Phasing using SHAPEIT (recommended) =
 
+
It is recommended to phase a whole chromosome using SHAPEIT. The script below, named shapeit_phasing_job.sh(based on the example script on the SHAPEIT website), illustrates how to do this. The input GWAS data in the script below is in PLINK BED format, while data in PLINK PED format and IMPUTE format are also accepted. It requires one parameter (chromosome number). The option parameters --effective-size 11418 and --thread 8 reflect an example European population and 8-core computer.  
6. Imputation:
  −
 
  −
There are two choices for imputation step: imputing from best-guess haplotypes, or imputing from posterior haplotypes.
  −
 
  −
Imputing from best-guess haplotypes uses the best-guess haplotypes output from the phasing step. It is simple and fasterRefer to the script “prototype_imputation_job_best_guess_haps.sh”:
      +
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
 
#$ -cwd
 
#$ -cwd
 
+
 
CHR=$1
 
CHR=$1
CHUNK_START=`printf "%.0f" $2`
+
CHUNK_END=`printf "%.0f" $3`
  −
 
   
# directories
 
# directories
 
ROOT_DIR=./
 
ROOT_DIR=./
 
DATA_DIR=${ROOT_DIR}data_files/
 
DATA_DIR=${ROOT_DIR}data_files/
 
RESULTS_DIR=${ROOT_DIR}results/
 
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
+
 
   
# executable
 
# executable
IMPUTE2_EXEC=${ROOT_DIR}impute2
+
SHAPEIT_EXEC=${ROOT_DIR}shapeit
 
+
 
# parameters
 
# parameters
NE=11500
+
THREAT=4
 
+
EFFECTIVESIZE=11418
 +
 
# reference data files
 
# reference data files
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
+
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
+
# GWAS data files in PLINK BED format
 +
GWASDATA_BED=${DATA_DIR}gwas_data_chr${CHR}.bed
 +
GWASDATA_BIM=${DATA_DIR}gwas_data_chr${CHR}.bim
 +
GWASDATA_FAM=${DATA_DIR}gwas_data_chr${CHR}.fam
 +
 
 +
# main output file
 +
OUTPUT_HAPS=${RESULTS_DIR}chr${CHR}.haps
 +
OUTPUT_SAMPLE=${RESULTS_DIR}chr${CHR}.sample
 +
OUTPUT_LOG=${RESULTS_DIR}chr${CHR}.log
 +
 +
## phase GWAS genotypes
 +
$SHAPEIT_EXEC \
 +
--input-bed $GWASDATA_BED $GWASDATA_BIM $GWASDATA_FAM \
 +
--input-map $GENMAP_FILE \
 +
--thread $THREAT \
 +
--effective-size $EFFECTIVESIZE \
 +
--output-max $OUTPUT_HAPS $OUTPUT_SAMPLE \
 +
--output-log $OUTPUT_LOG
 +
</source>
   −
# best-guess haplotypes from phasing run
  −
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
     −
# main output file
+
Similarly, you can submit batch jobs. The series of commands to run would look something like:
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.best_guess_haps_imputation.impute2
     −
## impute genotypes from best-guess GWAS haplotypes
+
<source lang="bash">
$IMPUTE2_EXEC \
+
sge ./shapeit_phasing_job.sh 1
-m $GENMAP_FILE \
+
sge ./shapeit_phasing_job.sh 2
-known_haps_g $GWAS_BG_HAPS_FILE \
+
sge ./shapeit_phasing_job.sh 3
-h $HAPS_FILE \
+
</source>
-l $LEGEND_FILE \
  −
-Ne $NE \
  −
-int $CHUNK_START $CHUNK_END \
  −
-o $OUTPUT_FILE \
  −
-allow_large_regions \
  −
-seed 367946
     −
where option –seed is used to make sure the imputation results can be duplicated.
+
= Imputation =
   −
Imputing from posterior haplotypes uses the sampled haplotypes from phasing step to impute the untyped SNPs and then average across these imputations. It would take, for example, N times longer than it would take by imputing from best-guess haplotypes, where N  = No. iterations – No.burnin if you set option “-thin 1”. Refer to the script “prototype_imputation_job_posterior_sampled_haps.sh”:
+
Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The original script is included in the package of [http://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz IMPUTE2 examples]. This version has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the Phase I integrated haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html available from the IMPUTE2 website]). We recommend to use the "ALL (macGT1)" reference haplotypes and to include the <code>-seed</code> parameter, which ensures results can be reproduced by running the script again.
    +
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
 
#$ -cwd
 
#$ -cwd
 
+
 
CHR=$1
 
CHR=$1
 
CHUNK_START=`printf "%.0f" $2`
 
CHUNK_START=`printf "%.0f" $2`
 
CHUNK_END=`printf "%.0f" $3`
 
CHUNK_END=`printf "%.0f" $3`
 
+
 
# directories
 
# directories
 
ROOT_DIR=./
 
ROOT_DIR=./
 
DATA_DIR=${ROOT_DIR}data_files/
 
DATA_DIR=${ROOT_DIR}data_files/
 
RESULTS_DIR=${ROOT_DIR}results/
 
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
+
 
+
# executable (must be IMPUTE version 2.2.0 or later)
# executable
   
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
+
 
# parameters
 
# parameters
ITER=30
+
NE=20000
BURNIN=10
+
THIN=1
+
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
NE=11500
  −
 
   
# reference data files
 
# reference data files
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
+
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.hap.gz
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
+
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.legend.gz
 +
 +
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
 +
## SELECT ONE FROM METHOD-A AND METHOD-B BELOW
 +
# METHOD-A: haplotypes from IMPUTE2 phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
 +
 
 +
# METHOD-B: haplotypes from SHAPEIT phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}chr${CHR}.haps
 +
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.impute2
 +
 +
## impute genotypes from GWAS haplotypes
 +
$IMPUTE2_EXEC \
 +
  -m $GENMAP_FILE \
 +
  -known_haps_g $GWAS_HAPS_FILE \
 +
  -h $HAPS_FILE \
 +
  -l $LEGEND_FILE \
 +
  -Ne $NE \
 +
  -int $CHUNK_START $CHUNK_END \
 +
  -o $OUTPUT_FILE \
 +
  -allow_large_regions \
 +
  -seed 367946
 +
</source>
   −
# GWAS data files
+
Filtering Option -filt_rules_l is used in this example to remove reference SNPs with MAF=0 in European reference panel (for example, a European cohort).
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
     −
# main output file
+
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.posterior_sampled_haps_imputation.impute2
     −
## impute genotypes from posterior-sampled GWAS haplotypes
+
= X-Chromosome Imputation =
$IMPUTE2_EXEC \
  −
-m $GENMAP_FILE \
  −
-g $GWAS_GTYPE_FILE \
  −
-stage_two \
  −
-hap_samp_dir $HAP_SAMP_DIR \
  −
-h $HAPS_FILE \
  −
-l $LEGEND_FILE \
  −
-iter $ITER \
  −
-burnin $BURNIN \
  −
-thin $THIN \
  −
-Ne $NE \
  −
-int $CHUNK_START $CHUNK_END \
  −
-o $OUTPUT_FILE \
  −
-allow_large_regions \
  −
-seed 367946
      +
Please refer to IMPUTE2 example for chromosome X imputation and IMPUTE2 chromosome X options.  Please note the difference on the pseudoautosomal and non-pseudoautosomal regions.
   −
Again, submit batch jobs using SGE for parallel analysis.
+
= Association Analysis =
   −
Timing: it took 46 hours for a 1400 samples data by imputing from posterior haplotypes with setting –iter 30, -burnin 10 and –thin 1.  It would take only about 2.3 hours (46/20) if imputing from best-guess haplotypes.
+
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.  
   −
If using the suggested parameters for IMPUTE2, ie, K=80, ITER=30, BURNIN=10 and NE=11500, the results using best guess haplotypes and results using posterior haplotypes are extremely similar if well phased haplotypes were generated (see Goncalo’s email on 1st April 2011).  Therefore, if timing is a concern, imputing from best guess haplotypes may be your choice.
+
If you are running SNP test, two useful pieces of advice are to:
   −
7. Final step is to combine the imputed results into the shape for SNPTEST. (In order to feed our 30 nodes cluster and do parallel SNPTEST analysis, we combined our imputed results by chromosome and then split each chromosome into 30 partitions. Each partition contains about the same number of SNPs within a chromosome. It takes less than an hour to do a SNPTEST for 1400 samples)
+
* Concatenate the dosage files for each chromosome (so that you have a 22 files to manage rather than several hundred).  
 +
* Remember that the per SNP imputation accuracy score (IMPUTE's ''INFO'' field) is calculate by SNPTEST and will be present in the SNPTEST output.
550

edits

Navigation menu