Difference between revisions of "EPACTS for DIAGRAM"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 8: Line 8:
  
 
#Download and install EPACTS  
 
#Download and install EPACTS  
#Convert the minimac or Impute2 output into VCF format
+
#Prepare VCF file with genotypes / dosages
#Prepre PED file for phenotypes and covariates  
+
#Prepare PED file with phenotypes and covariates  
 
#Run EPACTS association pipeline  
 
#Run EPACTS association pipeline  
 
#Report association results in appropriate format
 
#Report association results in appropriate format
Line 15: Line 15:
 
== 1.  Download and install EPACTS  ==
 
== 1.  Download and install EPACTS  ==
  
=== For external Users ===
+
=== For external Users ===
 +
 
 +
For external users, follow the instruction at [[EPACTS]] page, summarized below.
 +
 
 +
*Download EPACTS binary at http://www.sph.umich.edu/csg/kang/epacts/download/epacts_v2.1.noref_binary.2012_07_03.tar.gz (101MB)
 +
*Uncompress EPACTS package to the directory you would like to install
  
For external users, follow the instruction at [[EPACTS]] page, summarized below.
 
* Download EPACTS binary at http://www.sph.umich.edu/csg/kang/epacts/download/epacts_v2.1.noref_binary.2012_07_03.tar.gz (101MB)
 
* Uncompress EPACTS package to the directory you would like to install 
 
 
   tar xzvf epacts_v2.1.noref_binary.2012_07_03.tar.gz
 
   tar xzvf epacts_v2.1.noref_binary.2012_07_03.tar.gz
* Download the reference FASTA files by running the following commands
+
 
 +
*Download the reference FASTA files by running the following commands
 +
 
 
   cd epacts2.1/
 
   cd epacts2.1/
  ./ref_download.sh (Or copy the FASTA and index file locally you have to ${EPACTS_DIR}/ext/ref/)
+
./ref_download.sh (Or copy the FASTA and index file locally you have to ${EPACTS_DIR}/ext/ref/)
* Perform a test run by running the following command
+
 
 +
*Perform a test run by running the following command
 +
 
 
   example/test_run_epacts.sh
 
   example/test_run_epacts.sh
  
=== For Local Users in CSG ===
+
=== For Local Users in CSG ===
 +
 
 
For users in CSG, EPACTS can be found here:  
 
For users in CSG, EPACTS can be found here:  
 
<pre>/net/fantasia/home/hmkang/sw/epacts2.1/</pre>  
 
<pre>/net/fantasia/home/hmkang/sw/epacts2.1/</pre>  
Line 42: Line 49:
 
--out {OUTPUT_DIR}/test --run 2 &amp;
 
--out {OUTPUT_DIR}/test --run 2 &amp;
 
</pre>  
 
</pre>  
This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001. &nbsp;The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX. &nbsp;The output file directory prefix is {OUTPUT_DIR}/test. &nbsp;Finally, EPACTS will run the analysis in parallel on 2 CPUs.
+
This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001. &nbsp;The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX. &nbsp;The output file directory prefix is {OUTPUT_DIR}/test. &nbsp;Finally, EPACTS will run the analysis in parallel on 2 CPUs.  
  
== 2. &nbsp;Convert the minimac or Impute2 output into VCF format  ==
+
== 2. &nbsp;Prepare VCF file with genotypes / dosages ==
  
EPACTS requires input genotype / doseage information in VCF format. &nbsp;After imputation with either minimac or Impute2, use this wrapper program "dose2vcf" to convert your doseage output to VCF format.  
+
EPACTS requires input genotype / doseage information in VCF format. &nbsp;From minimac or Impute2, you wil start with your imputed dosage file.
  
Download and install the wrapper program here. [ PLEASE COMPLETE ]
+
=== A. &nbsp;Convert doseage file into VCF format ===
  
<br>
+
Use the wrapper program "dose2vcf" to convert your doseage output to VCF format. &nbsp;Download and install the wrapper program here. [ PLEASE COMPLETE ]
  
 
To run the wrapper program, use the following command  
 
To run the wrapper program, use the following command  
Line 58: Line 65:
 
<pre>&gt; dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1</pre>  
 
<pre>&gt; dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1</pre>  
 
The expected output file is:  
 
The expected output file is:  
<pre>out/FUSION.GWAS.1KG.imp.chr1.vcf</pre>
+
<pre>out/FUSION.GWAS.1KG.imp.chr1.vcf</pre>  
 +
=== B. &nbsp;bgzip and tabix VCF files ===
 +
 
 +
Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF
 +
<pre>bgzip input.vcf ## this command will produce input.vcf.gz
 +
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi
 +
</pre>
 +
If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.<br>Sample IDs in the VCF file must be consistent to those from PED file
  
 
== 3. &nbsp;Prepare PED file for phenotypes and covariates  ==
 
== 3. &nbsp;Prepare PED file for phenotypes and covariates  ==
Line 78: Line 92:
 
etc.  
 
etc.  
  
An example PED file with a header is as follows. &nbsp;Note that the header must start with a "#" symbol.
+
An example PED file with a header is as follows. &nbsp;Note that the header must start with a "#" symbol.  
 
<pre>#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE
 
<pre>#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE
 
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
 
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
Line 106: Line 120:
 
M QT
 
M QT
 
M AGE
 
M AGE
</pre>
+
</pre>  
 
 
 
== 4. &nbsp;Run EPACTS association pipeline  ==
 
== 4. &nbsp;Run EPACTS association pipeline  ==
  
Line 118: Line 131:
 
To run the Firth test as well, use:  
 
To run the Firth test as well, use:  
 
<pre>epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
 
<pre>epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -run 10
+
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -run 10
 
</pre>  
 
</pre>  
 
Note that the Firth bias-corrected test is more computationally intensive than the score test.  
 
Note that the Firth bias-corrected test is more computationally intensive than the score test.  
Line 124: Line 137:
 
For detailed description of options, use:  
 
For detailed description of options, use:  
 
<pre>epacts2.1/epacts single -man
 
<pre>epacts2.1/epacts single -man
</pre>
+
</pre>  
 
 
 
== 5. &nbsp;Report association results in appropriate format<br>  ==
 
== 5. &nbsp;Report association results in appropriate format<br>  ==
  

Revision as of 22:36, 27 September 2012

Motivation and Rationale

EPACTS is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data.  The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium.  In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively.  We implemented two statistical tests recommended for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression (Firth, 1993).  For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power.  For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies.

Outline of analysis protocol

This is an overview of the analysis protocol for analyzing imputed DIAGRAM datasets using the EPACTS pipeline.  We assume that your dataset has been imputed using minimac or Impute2.  Starting with minimac or impute2 output:

  1. Download and install EPACTS
  2. Prepare VCF file with genotypes / dosages
  3. Prepare PED file with phenotypes and covariates
  4. Run EPACTS association pipeline
  5. Report association results in appropriate format

1.  Download and install EPACTS

For external Users

For external users, follow the instruction at EPACTS page, summarized below.

 tar xzvf epacts_v2.1.noref_binary.2012_07_03.tar.gz
  • Download the reference FASTA files by running the following commands
 cd epacts2.1/
./ref_download.sh (Or copy the FASTA and index file locally you have to ${EPACTS_DIR}/ext/ref/)
  • Perform a test run by running the following command
 example/test_run_epacts.sh

For Local Users in CSG

For users in CSG, EPACTS can be found here:

/net/fantasia/home/hmkang/sw/epacts2.1/


Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are:

$ epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz

$ epacts2.1/example/1000G_dummy_pheno.ped


Run the single variant score test on the example data using this command:

$ epacts2.1/epacts single 
--vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz 
--ped epacts2.1/example/1000G_dummy_pheno.ped 
--min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.score --anno 
--out {OUTPUT_DIR}/test --run 2 &

This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001.  The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX.  The output file directory prefix is {OUTPUT_DIR}/test.  Finally, EPACTS will run the analysis in parallel on 2 CPUs.

2.  Prepare VCF file with genotypes / dosages

EPACTS requires input genotype / doseage information in VCF format.  From minimac or Impute2, you wil start with your imputed dosage file.

A.  Convert doseage file into VCF format

Use the wrapper program "dose2vcf" to convert your doseage output to VCF format.  Download and install the wrapper program here. [ PLEASE COMPLETE ]

To run the wrapper program, use the following command

> dose2vcf/dose2vcf --dose [ .dose file ]  --info [ .info file ] --out [ output file prefix ]

For example, for imputed chromosome 1 data from FUSION:

> dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1

The expected output file is:

out/FUSION.GWAS.1KG.imp.chr1.vcf

B.  bgzip and tabix VCF files

Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF

bgzip input.vcf ## this command will produce input.vcf.gz
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi

If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.
Sample IDs in the VCF file must be consistent to those from PED file

3.  Prepare PED file for phenotypes and covariates

EPACTS accepts the PED format supported by MERLIN or PLINK to represent the phenotypes and covariates.  You may prepare either (1) a PED file without column headers + accompanying DAT file, or (2) a PED file with column headers.  The standard PED format has 6 mandatory columns:

  1. Family ID
  2. Individual ID
  3. Paternal ID
  4. Maternal ID
  5. Sex (1=male; 2=female; other=unknown)
  6. Phenotype

Columns 7 and onwards are additonal covariates and or phenotypes.  For example

  1. QT
  2. AGE

etc.

An example PED file with a header is as follows.  Note that the header must start with a "#" symbol.

#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

Alternatively, you can prepare a PED file without a header, and include a corresponding DAT file describing the column headers.

13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

The corresponding DAT file is:

M DISEASE
M QT
M AGE

4.  Run EPACTS association pipeline

The basic EPACTS command for single variant tests is:

epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
-test b.score -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -run 10

This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE.  The VCF files are separated by chromosomes (option -sepchr).  All variants with at least one minor allele count will be analyzed (option -min-mac 1).  It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).

To run the Firth test as well, use:

epacts2.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -run 10

Note that the Firth bias-corrected test is more computationally intensive than the score test.

For detailed description of options, use:

epacts2.1/epacts single -man

5.  Report association results in appropriate format

Once the EPACTS pipeline has completed, you will find all of the results in the .epacts.gz file.  Here is an example of score test results:

#CHROM BEGIN END MARKER_ID NS AC CALLRATE MAF PVALUE SCORE
20 68303 68303 20:68303_A/G_Upstream:DEFB125 266 1 1 0.0018797 NA NA
20 68319 68319 20:68319_C/A_Upstream:DEFB125 266 0 1 0 NA NA
20 68396 68396 20:68396_C/T_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA
20 76635 76635 20:76635_A/T_Intron:DEFB125 266 0 1 0 NA NA
20 76689 76689 20:76689_T/C_Synonymous:DEFB125 266 0 1 0 NA NA
20 76690 76690 20:76690_T/C_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA
20 76700 76700 20:76700_G/A_Nonsynonymous:DEFB125 266 0 1 0 NA NA
20 76726 76726 20:76726_C/G_Nonsynonymous:DEFB125 266 0 1 0 NA NA
20 76771 76771 20:76771_C/T_Nonsynonymous:DEFB125 266 3 1 0.0056391 0.68484 0.40587

You will need to parse the information into the format as required here [NEED FINAL DRAFT OF ASSOCIATION ANALYSIS FILE AND COLUMN SPECIFICATIONS].