Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 97: Line 97:  
     -Ne $NE \
 
     -Ne $NE \
 
     -int $CHUNK_START $CHUNK_END \
 
     -int $CHUNK_START $CHUNK_END \
-o $OUTPUT_FILE \
+
    -o $OUTPUT_FILE \
-allow_large_regions
+
    -allow_large_regions
 
</source>
 
</source>
   −
where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using SGE (sun grid engine) for parallel analysis.  The batch file is something like
+
where the red-coloured text were modified by us to allow large regions and to use build 37 map. Then you can submit batch jobs using a grid engine for parallel analysis.  The series of commands to run would look something like:
    
<source lang="bash">
 
<source lang="bash">
Line 109: Line 109:  
</source>
 
</source>
   −
where sge is SGE command for submitting jobs (it may be qsub in your system), the three numbers are chr and the pair of chunk region.
+
where <code>sge</code> should be replaced with the appropriate command for submitting jobs to your cluster (<code>sge</code> applies to sun grid engine, other common choices might be <code>qsub</code> and <code>mosrun</mosrun>. The three numbers correspond to chromosome and chunk start and end positions.
   −
Timing: with 30 nodes cluster, it took 7 hours for a 1400 samples data.
+
On a 30 node cluster, phasing should take approximately 5 hours per 1000 individuals.
   −
6. Imputation:
+
= Imputation =
   −
There are two choices for imputation step: imputing from best-guess haplotypes, or imputing from posterior haplotypes.
+
There are two choices for imputation step: imputing from best-guess haplotypes and imputing from a sample of alternate haplotype configurations.
   −
Imputing from best-guess haplotypes uses the best-guess haplotypes output from the phasing step.  It is simple and faster. Refer to the script “prototype_imputation_job_best_guess_haps.sh”:
+
Imputing from best-guess haplotypes uses the best-guess haplotypes is much faster and we recommend it. Below is a lightly modified version of script “prototype_imputation_job_best_guess_haps.sh” that accomplishes imputation. The script has been modified to reference the most recent set of 1000 Genome Haplotypes (currently, the interim Phase I haplotypes [http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_interim.html available from the IMPUTE2 website]) and to include the <code>-seed</code> which ensures results can be reproduced by running the script again.
    +
<source lang="bash">
 
#!/bin/bash
 
#!/bin/bash
 
#$ -cwd
 
#$ -cwd
Line 138: Line 139:  
NE=11500
 
NE=11500
    +
## MODIFY THE FOLLOWING THREE LINES TO ACCOMODATE OTHER PANELS
 
# reference data files
 
# reference data files
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
 
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
+
HAPS_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.hap.gz
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
+
LEGEND_FILE=${DATA_DIR}ALL_1000G_phase1interim_jun2011_chr${CHR}_impute.legend
    +
## THESE HAPLOTYPES WOULD BE GENERATED BY THE PREVIOUS SCRIPT
 
# best-guess haplotypes from phasing run
 
# best-guess haplotypes from phasing run
 
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
 
GWAS_BG_HAPS_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.phasing.impute2_haps
Line 151: Line 154:  
## impute genotypes from best-guess GWAS haplotypes
 
## impute genotypes from best-guess GWAS haplotypes
 
$IMPUTE2_EXEC \
 
$IMPUTE2_EXEC \
-m $GENMAP_FILE \
+
  -m $GENMAP_FILE \
-known_haps_g $GWAS_BG_HAPS_FILE \
+
  -known_haps_g $GWAS_BG_HAPS_FILE \
-h $HAPS_FILE \
+
  -h $HAPS_FILE \
-l $LEGEND_FILE \
+
  -l $LEGEND_FILE \
-Ne $NE \
+
  -Ne $NE \
-int $CHUNK_START $CHUNK_END \
+
  -int $CHUNK_START $CHUNK_END \
-o $OUTPUT_FILE \
+
  -o $OUTPUT_FILE \
-allow_large_regions \
+
  -allow_large_regions \
-seed 367946
+
  -seed 367946
 
+
</source>
where option –seed is used to make sure the imputation results can be duplicated.
  −
 
  −
Imputing from posterior haplotypes uses the sampled haplotypes from phasing step to impute the untyped SNPs and then average across these imputations. It would take, for example, N times longer than it would take by imputing from best-guess haplotypes, where N  = No. iterations – No.burnin if you set option “-thin 1”.  Refer to the script “prototype_imputation_job_posterior_sampled_haps.sh”:
  −
 
  −
#!/bin/bash
  −
#$ -cwd
  −
 
  −
CHR=$1
  −
CHUNK_START=`printf "%.0f" $2`
  −
CHUNK_END=`printf "%.0f" $3`
  −
 
  −
# directories
  −
ROOT_DIR=./
  −
DATA_DIR=${ROOT_DIR}data_files/
  −
RESULTS_DIR=${ROOT_DIR}results/
  −
HAP_SAMP_DIR=${ROOT_DIR}sampled_haps/
  −
 
  −
# executable
  −
IMPUTE2_EXEC=${ROOT_DIR}impute2
  −
 
  −
# parameters
  −
ITER=30
  −
BURNIN=10
  −
THIN=1
  −
NE=11500
  −
 
  −
# reference data files
  −
GENMAP_FILE=${DATA_DIR}genetic_map_chr${CHR}_combined_b37.txt
  −
HAPS_FILE=${DATA_DIR}EUR.chr${CHR}.impute.hap
  −
LEGEND_FILE=${DATA_DIR}EUR.chr${CHR}.impute.legend
  −
 
  −
# GWAS data files
  −
GWAS_GTYPE_FILE=${DATA_DIR}gwas_data_chr${CHR}.gen
  −
 
  −
# main output file
  −
OUTPUT_FILE=${RESULTS_DIR}gwas_data_chr${CHR}.pos${CHUNK_START}-${CHUNK_END}.posterior_sampled_haps_imputation.impute2
  −
 
  −
## impute genotypes from posterior-sampled GWAS haplotypes
  −
$IMPUTE2_EXEC \
  −
-m $GENMAP_FILE \
  −
-g $GWAS_GTYPE_FILE \
  −
-stage_two \
  −
-hap_samp_dir $HAP_SAMP_DIR \
  −
-h $HAPS_FILE \
  −
-l $LEGEND_FILE \
  −
-iter $ITER \
  −
-burnin $BURNIN \
  −
-thin $THIN \
  −
-Ne $NE \
  −
-int $CHUNK_START $CHUNK_END \
  −
-o $OUTPUT_FILE \
  −
-allow_large_regions \
  −
-seed 367946
  −
 
     −
Again, submit batch jobs using SGE for parallel analysis.  
+
The syntax for starting each of these jobs would be similar for the phasing jobs and, again, you should use a suitable grid or cluster engine to submit multiple jobs in parallel.  
   −
Timing: it took 46 hours for a 1400 samples data by imputing from posterior haplotypes with setting –iter 30, -burnin 10 and –thin 1. It would take only about 2.3 hours (46/20) if imputing from best-guess haplotypes.
+
Imputing from a sample of alternate haplotype configurations could be achieved by modifying the “prototype_imputation_job_posterior_sampled_haps.sh” script.
   −
If using the suggested parameters for IMPUTE2, ie, K=80, ITER=30, BURNIN=10 and NE=11500, the results using best guess haplotypes and results using posterior haplotypes are extremely similar if well phased haplotypes were generated (see Goncalo’s email on 1st April 2011).  Therefore, if timing is a concern, imputing from best guess haplotypes may be your choice.
+
= Association Analysis ==
   −
7. Final step is to combine the imputed results into the shape for SNPTEST. (In order to feed our 30 nodes cluster and do parallel SNPTEST analysis, we combined our imputed results by chromosome and then split each chromosome into 30 partitions. Each partition contains about the same number of SNPs within a chromosome. It takes less than an hour to do a SNPTEST for 1400 samples)
+
If you got this far, the final step is to run SNPTEST or another appropriate tool using the imputation results as input.

Navigation menu