EPACTS for DIAGRAM

From Genome Analysis Wiki
Revision as of 22:41, 27 September 2012 by Clement Ma (talk | contribs)
Jump to navigationJump to search

Motivation and Rationale

EPACTS is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data.  The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium.  In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively.  We implemented two statistical tests recommended for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression (Firth, 1993).  For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power.  For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies.

Outline of analysis protocol

This is an overview of the analysis protocol for analyzing imputed DIAGRAM datasets using the EPACTS pipeline.  We assume that your dataset has been imputed using minimac or Impute2.  Starting with minimac or impute2 output:

  1. Download and install EPACTS
  2. Prepare VCF file with genotypes / dosages
  3. Prepare PED file with phenotypes and covariates
  4. Run EPACTS association pipeline
  5. Report association results in appropriate format

1.  Download and install EPACTS

For external Users

For external users, follow the instruction at EPACTS page, summarized below.

 tar xzvf epacts_v2.1.noref_binary.2012_07_03.tar.gz
  • Download the reference FASTA files by running the following commands
 cd epacts2.1/
./ref_download.sh (Or copy the FASTA and index file locally you have to ${EPACTS_DIR}/ext/ref/)
  • Perform a test run by running the following command
 example/test_run_epacts.sh

For Local Users in CSG

For users in CSG, EPACTS can be found here:

/net/fantasia/home/hmkang/sw/epacts2.1/

Getting started with an example

Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are:

$ epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz

$ epacts2.1/example/1000G_dummy_pheno.ped


Run the single variant score test on the example data using this command:

$ epacts2.1/epacts single 
--vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz 
--ped epacts2.1/example/1000G_dummy_pheno.ped 
--min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.score --anno 
--out {OUTPUT_DIR}/test --run 2 &

This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001.  The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX.  The output file directory prefix is {OUTPUT_DIR}/test.  Finally, EPACTS will run the analysis in parallel on 2 CPUs.

A more detailed description of the example can be found here.

2.  Prepare VCF file with genotypes / dosages

EPACTS requires input genotype / doseage information in VCF format.  From minimac or Impute2, you wil start with your imputed dosage file.

A.  Convert doseage file into VCF format

Use the wrapper program "dose2vcf" to convert your doseage output to VCF format.  Download and install the wrapper program here. [ PLEASE COMPLETE ]

To run the wrapper program, use the following command

> dose2vcf/dose2vcf --dose [ .dose file ]  --info [ .info file ] --out [ output file prefix ]

For example, for imputed chromosome 1 data from FUSION:

> dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1

The expected output file is:

out/FUSION.GWAS.1KG.imp.chr1.vcf

B.  bgzip and tabix VCF files

Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF

bgzip input.vcf ## this command will produce input.vcf.gz
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi

If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.
Sample IDs in the VCF file must be consistent to those from PED file

3.  Prepare PED file for phenotypes and covariates

EPACTS accepts the PED format supported by MERLIN or PLINK to represent the phenotypes and covariates.  You may prepare either (1) a PED file without column headers + accompanying DAT file, or (2) a PED file with column headers.  The standard PED format has 6 mandatory columns:

  1. Family ID
  2. Individual ID
  3. Paternal ID
  4. Maternal ID
  5. Sex (1=male; 2=female; other=unknown)
  6. Phenotype (1 = control; 2 = case)

Columns 7 and onwards are additonal covariates and or phenotypes.  For example

  1. QT
  2. AGE

etc.

An example PED file with a header is as follows.  Note that the header must start with a "#" symbol.

#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

Alternatively, you can prepare a PED file without a header, and include a corresponding DAT file describing the column headers.

13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

The corresponding DAT file is:

M DISEASE
M QT
M AGE

4.  Run EPACTS association pipeline

The basic EPACTS command for single variant tests is:

epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
-test b.score -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -run 10

This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE.  The VCF files are separated by chromosomes (option -sepchr).  All variants with at least one minor allele count will be analyzed (option -min-mac 1).  It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).

To run the Firth test as well, use:

epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -run 10

Note that the Firth bias-corrected test is more computationally intensive than the score test.

For detailed description of options, use:

epacts2.1/epacts single -man

5.  Report association results in appropriate format

Once the EPACTS pipeline has completed, you will find all of the results in the .epacts.gz file.  Here is an example of score test results:

#CHROM BEGIN END MARKER_ID NS AC CALLRATE MAF PVALUE SCORE
20 68303 68303 20:68303_A/G_Upstream:DEFB125 266 1 1 0.0018797 NA NA
20 68319 68319 20:68319_C/A_Upstream:DEFB125 266 0 1 0 NA NA
20 68396 68396 20:68396_C/T_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA
20 76635 76635 20:76635_A/T_Intron:DEFB125 266 0 1 0 NA NA
20 76689 76689 20:76689_T/C_Synonymous:DEFB125 266 0 1 0 NA NA
20 76690 76690 20:76690_T/C_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA
20 76700 76700 20:76700_G/A_Nonsynonymous:DEFB125 266 0 1 0 NA NA
20 76726 76726 20:76726_C/G_Nonsynonymous:DEFB125 266 0 1 0 NA NA
20 76771 76771 20:76771_C/T_Nonsynonymous:DEFB125 266 3 1 0.0056391 0.68484 0.40587

You will need to parse the information into the format as required here [NEED FINAL DRAFT OF ASSOCIATION ANALYSIS FILE AND COLUMN SPECIFICATIONS].