Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 34: Line 34:     
= Pre-Phasing =
 
= Pre-Phasing =
 +
 +
With large reference panels, such as those now produced by the 1000 Genome Project, it is recommended that
 +
imputation should be carried out in two steps. In a first pre-phasing step, haplotypes are estimated for all available samples. In a second step, missing alleles are imputed directly onto these phased haplotypes. The process is computationally efficient and, if 1000 Genome Project haplotypes are updated, only the second step must be repeated.
 +
 +
Drs. Bryan Howie and Jonathan Marchini provide examples of this two step imputation process [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz on their website].
 +
 +
== Sliding Window Analyses ==
 +
 +
The IMPUTE2 authors recommend that chromosomes should analyzed in blocks of a few megabases each. In a cluster environment, these blocks can be conveniently analyzed at the same time. You won't need to physically split the genotype data - instead, you can define the region to be analyzed using the <code>-int ''start'' ''stop''</code> command line option. For example, to analyze a 5-Mb region starting at 10-Mb along the chromosome, you would specify <code>-int 10000001 15000000</code>.
 +
 +
There are two small wrinkles to consider.  First, if the chunks are all of the same size (in Mb), it is possible that a few chunks will have too many (or too few) SNPs to analyze. Second, in intervals that cross the centromere (which typically corresponds to large gap in 1000 Genome SNPs), it may happen that only a few SNPs from one arm are chunked together with SNPs from a different chromosome arm. To overcome these two concerns, we suggest that you merge adjacent chunks when one has very few (say, less than 200) SNPs in a chunk and that you try to avoid creating region splits that span the centromere. If a merged chunk extends further than 7-Mb, you will have to use the setting <code>-allow_large_regions</code> when you run IMPUTE2.
 +
 +
Centromere locations are available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
 +
 +
== Pre-Phasing using IMPUTE2 ===
 +
 +
You should now be ready to run the first analysis step, which is to estimate phased haplotypes for each sample. The script below, named ''prototype_phasing_job.sh''(based on the [https://mathgen.stats.ox.ac.uk/impute/prephasing_and_imputation_with_impute2.tgz example script] on the IMPUTE2 website), illustrates how to do this. It requires three parameters (chromosome number, interval start and interval end) and assumes that input data files will in a directory called <code>data_files</code>, with results in <code>results</code> directory and estimated haplotypes in a <code>sampled_haps</code> directory. It has been modified to specify the build 37 recombination map should be used and to include the <source>-allow_large_regions</source> parameter.
 +
 +
<source lang="bash">
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
CHUNK_START=`printf "%.0f" $2`
 +
CHUNK_END=`printf "%.0f" $3`
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
 +
 +
# executable
 +
IMPUTE2_EXEC=${ROOT_DIR}impute2
 +
 +
# parameters
 +
K=80
 +
ITER=30
 +
BURNIN=10
 +
NE=11500
 +
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
 +
# GWAS data files
 +
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
 +
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2
 +
 +
## phase GWAS genotypes
 +
$IMPUTE2_EXEC \
 +
    -m $GENMAP_FILE \
 +
    -g $GWAS_GTYPE_FILE \
 +
    -phase \
 +
    -include_buffer_in_output \
 +
    -stage_one \
 +
    -hap_samp_dir $HAP_SAMP_DIR \
 +
    -k $K \
 +
    -iter $ITER \
 +
    -burnin $BURNIN \
 +
    -Ne $NE \
 +
    -int $CHUNK_START $CHUNK_END \
 +
-o $OUTPUT_FILE \
 +
-allow_large_regions
 +
</source>
 +
 +
where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using SGE (sun grid engine) for parallel analysis.  The batch file is something like
 +
 +
<source lang="bash">
 +
sge ./prototype_phasing_job.sh 1 1 5000000
 +
sge ./prototype_phasing_job.sh 1 5000001 10000000
 +
sge ./prototype_phasing_job.sh 1 10000001 15000000
 +
</source>
 +
 +
where sge is SGE command for submitting jobs (it may be qsub in your system), the three numbers are chr and the pair of chunk region. 
 +
 +
Timing: with 30 nodes cluster, it took 7 hours for a 1400 samples data.
 +
 +
6. Imputation:
 +
 +
There are two choices for imputation step: imputing from best-guess haplotypes, or imputing from posterior haplotypes. 
 +
 +
Imputing from best-guess haplotypes uses the best-guess haplotypes output from the phasing step.  It is simple and faster.  Refer to the script “prototype_imputation_job_best_guess_haps.sh”:
 +
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
CHUNK_START=`printf "%.0f" $2`
 +
CHUNK_END=`printf "%.0f" $3`
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
 +
 +
# executable
 +
IMPUTE2_EXEC=${ROOT_DIR}impute2
 +
 +
# parameters
 +
NE=11500
 +
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
 +
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
 +
 +
# best-guess haplotypes from phasing run
 +
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
 +
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.best_guess_haps_imputation.impute2
 +
 +
## impute genotypes from best-guess GWAS haplotypes
 +
$IMPUTE2_EXEC \
 +
-m $GENMAP_FILE \
 +
-known_haps_g $GWAS_BG_HAPS_FILE \
 +
-h $HAPS_FILE \
 +
-l $LEGEND_FILE \
 +
-Ne $NE \
 +
-int $CHUNK_START $CHUNK_END \
 +
-o $OUTPUT_FILE \
 +
-allow_large_regions \
 +
-seed 367946
 +
 +
where option –seed is used to make sure the imputation results can be duplicated.
 +
 +
Imputing from posterior haplotypes uses the sampled haplotypes from phasing step to impute the untyped SNPs and then average across these imputations. It would take, for example, N times longer than it would take by imputing from best-guess haplotypes, where N  = No. iterations – No.burnin if you set option “-thin 1”.  Refer to the script “prototype_imputation_job_posterior_sampled_haps.sh”:
 +
 +
#!/bin/bash
 +
#$ -cwd
 +
 +
CHR=$1
 +
CHUNK_START=`printf "%.0f" $2`
 +
CHUNK_END=`printf "%.0f" $3`
 +
 +
# directories
 +
ROOT_DIR=./
 +
DATA_DIR=${ROOT_DIR}data_files/
 +
RESULTS_DIR=${ROOT_DIR}results/
 +
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
 +
 +
# executable
 +
IMPUTE2_EXEC=${ROOT_DIR}impute2
 +
 +
# parameters
 +
ITER=30
 +
BURNIN=10
 +
THIN=1
 +
NE=11500
 +
 +
# reference data files
 +
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 +
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
 +
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
 +
 +
# GWAS data files
 +
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
 +
 +
# main output file
 +
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.posterior_sampled_haps_imputation.impute2
 +
 +
## impute genotypes from posterior-sampled GWAS haplotypes
 +
$IMPUTE2_EXEC \
 +
-m $GENMAP_FILE \
 +
-g $GWAS_GTYPE_FILE \
 +
-stage_two \
 +
-hap_samp_dir $HAP_SAMP_DIR \
 +
-h $HAPS_FILE \
 +
-l $LEGEND_FILE \
 +
-iter $ITER \
 +
-burnin $BURNIN \
 +
-thin $THIN \
 +
-Ne $NE \
 +
-int $CHUNK_START $CHUNK_END \
 +
-o $OUTPUT_FILE \
 +
-allow_large_regions \
 +
-seed 367946
 +
 +
 +
Again, submit batch jobs using SGE for parallel analysis.
 +
 +
Timing: it took 46 hours for a 1400 samples data by imputing from posterior haplotypes with setting –iter 30, -burnin 10 and –thin 1.  It would take only about 2.3 hours (46/20) if imputing from best-guess haplotypes.
 +
 +
If using the suggested parameters for IMPUTE2, ie, K=80, ITER=30, BURNIN=10 and NE=11500, the results using best guess haplotypes and results using posterior haplotypes are extremely similar if well phased haplotypes were generated (see Goncalo’s email on 1st April 2011).  Therefore, if timing is a concern, imputing from best guess haplotypes may be your choice.
 +
 +
7. Final step is to combine the imputed results into the shape for SNPTEST. (In order to feed our 30 nodes cluster and do parallel SNPTEST analysis, we combined our imputed results by chromosome and then split each chromosome into 30 partitions. Each partition contains about the same number of SNPs within a chromosome. It takes less than an hour to do a SNPTEST for 1400 samples)

Navigation menu