Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 8: Line 8:     
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 
This page documents how to carry out imputation using [http://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE2] software (developed by Jonathan Marchini and Bryan Howie) and 1000 Genomes reference panel haplotypes.
 +
For information on how to carry out 1000 Genomes Imputation using [[Minimac]], see [[Minimac:_1000_Genomes_Imputation_Cookbook]].
 +
 +
As of now, the most recent version of the 1000 Genome Project haplotypes are the Phase I Integrated release haplotypes (available in IMPUTE2 format from [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html this link]). This haplotype set is more complete and more accurate than prior 1000 Genome Project haplotypes and is recommended for all imputation based analyses.
    
= Before Imputation =
 
= Before Imputation =
Line 33: Line 36:  
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
 
[http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html GTOOL] can be used to convert data from PLINK PED format to IMPUTE format.
   −
= Pre-Phasing =
+
If you use SHAPEIT for pre-phasing (see next section), you do not have to convert autosomal chromosomes to IMPUTE format because SHAPEIT can take input data in either PLINK PED or BED format, or IMPUTE format.
 +
 
 +
= Pre-Phasing (autosomal chromosomes only) =
 +
 
 +
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that
 +
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.
 +
 
 +
Drs. Bryan Howie and Jonathan Marchini provide descriptive and examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].
 +
 
 +
There are two programs that can be used for pre-phasing: IMPUTE2 or SHAPEIT. Drs. Bryan Howie and Jonathan Marchini recommend SHAPEIT which uses an accurate phasing method. Output from SHAPEIT is in IMPUTE2 format for next imputation step.
 +
 
 +
 
 +
== Sliding Window Analyses ==
 +
 
 +
The IMPUTE2 authors recommend that chromosomes should analyzed in blocks of a few megabases each. In a cluster environment, these blocks can be conveniently analyzed at the same time. You won't need to physically split the genotype data - instead, you can define the region to be analyzed using the <code>-int ''start'' ''stop''</code> command line option. For example, to analyze a 5-Mb region starting at 10-Mb along the chromosome, you would specify <code>-int 10000001 15000000</code>.
 +
 
 +
There are two small wrinkles to consider.  First, if the chunks are all of the same size (in Mb), it is possible that a few chunks will have too many (or too few) SNPs to analyze. Second, in intervals that cross the centromere (which typically corresponds to large gap in 1000 Genome SNPs), it may happen that only a few SNPs from one arm are chunked together with SNPs from a different chromosome arm. To overcome these two concerns, we suggest that you merge adjacent chunks when one has very few (say, less than 200) SNPs in a chunk and that you try to avoid creating region splits that span the centromere. If a merged chunk extends further than 7-Mb, you will have to use the setting <code>-allow_large_regions</code> when you run IMPUTE2.
 +
 
 +
Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
 +
 
 +
== Pre-Phasing using IMPUTE2 ==
 +
 
 +
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] and an outdated document on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <code>-allow_large_regions</code> parameter and to set the effective population size NE=20,000 (as recommended in the IMPUTE2 website for the more recent 1000 Genome Project builds).
 +
 
 +
<source lang="bash">
 +
#!/bin/bash
 +
#$ -cwd
 +
 
 +
CHR=$1
 +
CHUNK_START=`printf "%.0f" $2`
 +
CHUNK_END=`printf "%.0f" $3`
 +
 
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
 
 +
# executable (must be IMPUTE version 2.2.0 or later)
 +
IMPUTE2_EXEC=${ROOT_DIR}impute2
 +
 
 +
# parameters
 +
NE=20000
 +
 
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
 
 +
# GWAS data files
 +
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
 +
 
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2
 +
 
 +
## phase GWAS genotypes
 +
$IMPUTE2_EXEC \
 +
    -prephase_g \
 +
    -m $GENMAP_FILE \
 +
    -g $GWAS_GTYPE_FILE \
 +
    -Ne $NE \
 +
    -int $CHUNK_START $CHUNK_END \
 +
    -o $OUTPUT_FILE \
 +
    -allow_large_regions
 +
</source>
 +
 
 +
Then you can submit batch jobs using a grid engine for parallel analysis.  The series of commands to run would look something like:
 +
 
 +
<source lang="bash">
 +
sge ./prototype_phasing_job.sh 1 1 5000000
 +
sge ./prototype_phasing_job.sh 1 5000001 10000000
 +
sge ./prototype_phasing_job.sh 1 10000001 15000000
 +
</source>
 +
 
 +
where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</code>). The three numbers correspond to chromosome and chunk start and end positions.
 +
 
 +
= Pre-Phasing using SHAPEIT (recommended) =
 +
It is recommended to phase a whole chromosome using SHAPEIT.  The script below, named shapeit_phasing_job.sh(based on the example script on the SHAPEIT website), illustrates how to do this. The input GWAS data in the script below is in PLINK BED format, while data in PLINK PED format and IMPUTE format are also accepted. It requires one parameter (chromosome number). The option parameters --effective-size 11418 and --thread 8  reflect an example European population and 8-core computer.
 +
 
 +
<source lang="bash">
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
 +
# executable
 +
SHAPEIT_EXEC=${ROOT_DIR}shapeit
 +
 +
# parameters
 +
THREAT=4
 +
EFFECTIVESIZE=11418
 +
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
 +
# GWAS data files in PLINK BED format
 +
GWASDATA_BED=${DATA_DIR}gwas_data_chr${CHR}.bed
 +
GWASDATA_BIM=${DATA_DIR}gwas_data_chr${CHR}.bim
 +
GWASDATA_FAM=${DATA_DIR}gwas_data_chr${CHR}.fam
 +
 
 +
# main output file
 +
OUTPUT_HAPS=${RESULTS_DIR}chr${CHR}.haps
 +
OUTPUT_SAMPLE=${RESULTS_DIR}chr${CHR}.sample
 +
OUTPUT_LOG=${RESULTS_DIR}chr${CHR}.log
 +
 +
## phase GWAS genotypes
 +
$SHAPEIT_EXEC \
 +
--input-bed $GWASDATA_BED $GWASDATA_BIM $GWASDATA_FAM \
 +
--input-map $GENMAP_FILE \
 +
--thread $THREAT \
 +
--effective-size $EFFECTIVESIZE \
 +
--output-max $OUTPUT_HAPS $OUTPUT_SAMPLE \
 +
--output-log $OUTPUT_LOG
 +
</source>
 +
 
 +
 
 +
Similarly, you can submit batch jobs. The series of commands to run would look something like:
 +
 
 +
<source lang="bash">
 +
sge ./shapeit_phasing_job.sh 1
 +
sge ./shapeit_phasing_job.sh 2
 +
sge ./shapeit_phasing_job.sh 3
 +
</source>
 +
 
 +
= Imputation =
 +
 
 +
Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The original script is included in the package of [http://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz IMPUTE2 examples]. This version has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the Phase I integrated haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html available from the IMPUTE2 website]). We recommend to use the "ALL (macGT1)" reference haplotypes and to include the <code>-seed</code> parameter, which ensures results can be reproduced by running the script again.
 +
 
 +
<source lang="bash">
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
CHUNK_START=`printf "%.0f" $2`
 +
CHUNK_END=`printf "%.0f" $3`
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
 +
# executable (must be IMPUTE version 2.2.0 or later)
 +
IMPUTE2_EXEC=${ROOT_DIR}impute2
 +
 +
# parameters
 +
NE=20000
 +
 +
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.hap.gz
 +
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1integrated_v3_chr${CHR}_impute.legend.gz
 +
 +
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
 +
## SELECT ONE FROM METHOD-A AND METHOD-B BELOW
 +
# METHOD-A: haplotypes from IMPUTE2 phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
 +
 
 +
# METHOD-B: haplotypes from SHAPEIT phasing run
 +
GWAS_HAPS_FILE=${RESULTS_DIR}chr${CHR}.haps
 +
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.impute2
 +
 +
## impute genotypes from GWAS haplotypes
 +
$IMPUTE2_EXEC \
 +
  -m $GENMAP_FILE \
 +
  -known_haps_g $GWAS_HAPS_FILE \
 +
  -h $HAPS_FILE \
 +
  -l $LEGEND_FILE \
 +
  -Ne $NE \
 +
  -int $CHUNK_START $CHUNK_END \
 +
  -o $OUTPUT_FILE \
 +
  -allow_large_regions \
 +
  -seed 367946
 +
</source>
 +
 
 +
Filtering Option -filt_rules_l is used in this example to remove reference SNPs with MAF=0 in European reference panel (for example, a European cohort).
 +
 
 +
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.
 +
 
 +
= X-Chromosome Imputation =
 +
 
 +
Please refer to IMPUTE2 example for chromosome X imputation and IMPUTE2 chromosome X options.  Please note the difference on the pseudoautosomal and non-pseudoautosomal regions.
 +
 
 +
= Association Analysis =
 +
 
 +
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.
 +
 
 +
If you are running SNP test, two useful pieces of advice are to:
 +
 
 +
* Concatenate the dosage files for each chromosome (so that you have a 22 files to manage rather than several hundred).
 +
* Remember that the per SNP imputation accuracy score (IMPUTE's ''INFO'' field) is calculate by SNPTEST and will be present in the SNPTEST output.
550

edits

Navigation menu