Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 8: Line 8:     
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 +
For information on how to carry out 1000 Genomes Imputation using [[Minimac]], see [[Minimac:_1000_Genomes_Imputation_Cookbook]].
 +
 +
As of now, the most recent version of the 1000 Genome Project haplotypes are the Phase I Integrated release haplotypes (available in IMPUTE2 format from [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html this link]). This haplotype set is more complete and more accurate than prior 1000 Genome Project haplotypes and is recommended for all imputation based analyses.
    
= Before Imputation =
 
= Before Imputation =
Line 33: Line 36:  
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
 
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
   −
= Pre-Phasing =
+
If you use SHAPEIT for pre-phasing (see next section), you do not have to convert autosomal chromosomes to IMPUTE format because SHAPEIT can take input data in either PLINK PED or BED format, or IMPUTE format.
 +
 
 +
= Pre-Phasing (autosomal chromosomes only) =
    
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that  
 
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that  
 
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.  
 
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.  
   −
Drs. Bryan Howie and Jonathan Marchini provide examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].
+
Drs. Bryan Howie and Jonathan Marchini provide descriptive and examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].  
 +
 
 +
There are two programs that can be used for pre-phasing: IMPUTE2 or SHAPEIT. Drs. Bryan Howie and Jonathan Marchini recommend SHAPEIT which uses an accurate phasing method. Output from SHAPEIT is in IMPUTE2 format for next imputation step.
 +
 
    
== Sliding Window Analyses ==
 
== Sliding Window Analyses ==
Line 50: Line 58:  
== Pre-Phasing using IMPUTE2 ==
 
== Pre-Phasing using IMPUTE2 ==
   −
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory and estimated haplotypes in a <code>sampled_haps</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <code>-allow_large_regions</code> parameter.
+
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] and an outdated document on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <code>-allow_large_regions</code> parameter and to set the effective population size NE=20,000 (as recommended in the IMPUTE2 website for the more recent 1000 Genome Project builds).
    
<source lang="bash">
 
<source lang="bash">
Line 64: Line 72:  
DATA_DIR=${ROOT_DIR}data_files/
 
DATA_DIR=${ROOT_DIR}data_files/
 
RESULTS_DIR=${ROOT_DIR}results/
 
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
     −
# executable
+
# executable (must be IMPUTE version 2.2.0 or later)
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
    
# parameters
 
# parameters
K=80
+
NE=20000
ITER=30
  −
BURNIN=10
  −
NE=11500
      
# reference data files
 
# reference data files
Line 86: Line 90:  
## phase GWAS genotypes
 
## phase GWAS genotypes
 
$IMPUTE2_EXEC \
 
$IMPUTE2_EXEC \
 +
    -prephase_g \
 
     -m $GENMAP_FILE \
 
     -m $GENMAP_FILE \
 
     -g $GWAS_GTYPE_FILE \
 
     -g $GWAS_GTYPE_FILE \
    -phase \
  −
    -include_buffer_in_output \
  −
    -stage_one \
  −
    -hap_samp_dir $HAP_SAMP_DIR \
  −
    -k $K \
  −
    -iter $ITER \
  −
    -burnin $BURNIN \
   
     -Ne $NE \
 
     -Ne $NE \
 
     -int $CHUNK_START $CHUNK_END \
 
     -int $CHUNK_START $CHUNK_END \
Line 101: Line 99:  
</source>
 
</source>
   −
where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using a grid engine for parallel analysis.  The series of commands to run would look something like:
+
Then you can submit batch jobs using a grid engine for parallel analysis.  The series of commands to run would look something like:
    
<source lang="bash">
 
<source lang="bash">
Line 111: Line 109:  
where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</code>). The three numbers correspond to chromosome and chunk start and end positions.
 
where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</code>). The three numbers correspond to chromosome and chunk start and end positions.
   −
On a 30 node cluster, phasing should take approximately 5 hours per 1000 individuals.
+
= Pre-Phasing using SHAPEIT (recommended) =
 +
It is recommended to phase a whole chromosome using SHAPEIT.  The script below, named shapeit_phasing_job.sh(based on the example script on the SHAPEIT website), illustrates how to do this. The input GWAS data in the script below is in PLINK BED format, while data in PLINK PED format and IMPUTE format are also accepted. It requires one parameter (chromosome number). The option parameters --effective-size 11418 and --thread 8  reflect an example European population and 8-core computer.
 +
 
 +
<source lang="bash">
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
 +
# executable
 +
SHAPEIT_EXEC=${ROOT_DIR}shapeit
 +
 +
# parameters
 +
THREAT=4
 +
EFFECTIVESIZE=11418
 +
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
 +
# GWAS data files in PLINK BED format
 +
GWASDATA_BED=${DATA_DIR}gwas_data_chr${CHR}.bed
 +
GWASDATA_BIM=${DATA_DIR}gwas_data_chr${CHR}.bim
 +
GWASDATA_FAM=${DATA_DIR}gwas_data_chr${CHR}.fam
 +
 
 +
# main output file
 +
OUTPUT_HAPS=${RESULTS_DIR}chr${CHR}.haps
 +
OUTPUT_SAMPLE=${RESULTS_DIR}chr${CHR}.sample
 +
OUTPUT_LOG=${RESULTS_DIR}chr${CHR}.log
 +
 +
## phase GWAS genotypes
 +
$SHAPEIT_EXEC \
 +
--input-bed $GWASDATA_BED $GWASDATA_BIM $GWASDATA_FAM \
 +
--input-map $GENMAP_FILE \
 +
--thread $THREAT \
 +
--effective-size $EFFECTIVESIZE \
 +
--output-max $OUTPUT_HAPS $OUTPUT_SAMPLE \
 +
--output-log $OUTPUT_LOG
 +
</source>
 +
 
 +
 
 +
Similarly, you can submit batch jobs. The series of commands to run would look something like:
 +
 
 +
<source lang="bash">
 +
sge ./shapeit_phasing_job.sh 1
 +
sge ./shapeit_phasing_job.sh 2
 +
sge ./shapeit_phasing_job.sh 3
 +
</source>
    
= Imputation =
 
= Imputation =
   −
There are two choices for imputation step: imputing from best-guess haplotypes and imputing from a sample of alternate haplotype configurations.
+
Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The original script is included in the package of [http://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz IMPUTE2 examples]. This version has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the Phase I integrated haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html available from the IMPUTE2 website]). We recommend to use the "ALL (macGT1)" reference haplotypes and to include the <code>-seed</code> parameter, which ensures results can be reproduced by running the script again.
 
  −
Imputing from best-guess haplotypes uses the best-guess haplotypes is much faster and we recommend it. Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The script has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the interim Phase I haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_interim.html available from the IMPUTE2 website]) and to include the <code>-seed</code> which ensures results can be reproduced by running the script again.
      
<source lang="bash">
 
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
 
#$ -cwd
 
#$ -cwd
 
+
 
CHR=$1
 
CHR=$1
 
CHUNK_START=`printf "%.0f" $2`
 
CHUNK_START=`printf "%.0f" $2`
 
CHUNK_END=`printf "%.0f" $3`
 
CHUNK_END=`printf "%.0f" $3`
 
+
 
# directories
 
# directories
 
ROOT_DIR=./
 
ROOT_DIR=./
 
DATA_DIR=${ROOT_DIR}data_files/
 
DATA_DIR=${ROOT_DIR}data_files/
 
RESULTS_DIR=${ROOT_DIR}results/
 
RESULTS_DIR=${ROOT_DIR}results/
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
+
 
+
# executable (must be IMPUTE version 2.2.0 or later)
# executable
   
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
IMPUTE2_EXEC=${ROOT_DIR}impute2
 
+
 
# parameters
 
# parameters
NE=11500
+
NE=20000
 
+
 
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
 
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
 
# reference data files
 
# reference data files
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.hap.gz
+
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.hap.gz
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.legend
+
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.legend.gz
 
+
 
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
 
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
# best-guess haplotypes from phasing run
+
## SELECT ONE FROM METHOD-A AND METHOD-B BELOW
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
+
# METHOD-A: haplotypes from IMPUTE2 phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
    +
# METHOD-B: haplotypes from SHAPEIT phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}chr${CHR}.haps
 +
 
# main output file
 
# main output file
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.best_guess_haps_imputation.impute2
+
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.impute2
 
+
## impute genotypes from best-guess GWAS haplotypes
+
## impute genotypes from GWAS haplotypes
 
$IMPUTE2_EXEC \
 
$IMPUTE2_EXEC \
 
   -m $GENMAP_FILE \
 
   -m $GENMAP_FILE \
   -known_haps_g $GWAS_BG_HAPS_FILE \
+
   -known_haps_g $GWAS_HAPS_FILE \
 
   -h $HAPS_FILE \
 
   -h $HAPS_FILE \
 
   -l $LEGEND_FILE \
 
   -l $LEGEND_FILE \
Line 165: Line 215:  
</source>
 
</source>
   −
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.  
+
Filtering Option -filt_rules_l is used in this example to remove reference SNPs with MAF=0 in European reference panel (for example, a European cohort).
 +
 
 +
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.
 +
 
 +
= X-Chromosome Imputation =
 +
 
 +
Please refer to IMPUTE2 example for chromosome X imputation and IMPUTE2 chromosome X options.  Please note the difference on the pseudoautosomal and non-pseudoautosomal regions.
 +
 
 +
= Association Analysis =
   −
Imputing from a sample of alternate haplotype configurations could be achieved by modifying the “prototype_imputation_job_posterior_sampled_haps.sh” script.
+
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.  
   −
= Association Analysis ==
+
If you are running SNP test, two useful pieces of advice are to:
   −
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.
+
* Concatenate the dosage files for each chromosome (so that you have a 22 files to manage rather than several hundred).
 +
* Remember that the per SNP imputation accuracy score (IMPUTE's ''INFO'' field) is calculate by SNPTEST and will be present in the SNPTEST output.
550

edits

Navigation menu