Changes

5,087 bytes added , 10:05, 2 February 2017

→‎A. Convert dosage file into VCF format

Line 19: Line 19:

For external users, follow the instruction at [[EPACTS]] page, summarized below.

−

*~~Download~~ EPACTS ~~binary at~~ http://~~www~~.sph.umich.edu/~~csg~~/kang/epacts/download/~~epacts_v2.12.noref_binary.2012_10_01.tar.gz (101MB)~~

+

*Please download the latest version of EPACTS here:  http://csg.sph.umich.edu//kang/epacts/download/

−

*Uncompress EPACTS package to the directory you would like to install

+

*Uncompress EPACTS package to the directory you would like to install and then type the following commands

−

tar xzvf ~~epacts_v2~~.12.~~noref_binary.2012_10_01~~.tar.gz

+

>  tar xzvf EPACTS-3.0.0.tar.gz

+

> cd EPACTS-3.0.0

+

> ./configure --prefix [INSTALL DIRECTORY]

+

> make

+

> make install

+

−

*Download the reference FASTA files from 1000 Genomes FTP automatically by running the following commands

−

~~<pre>cd epacts2.1/~~

−

~~./ref_download.sh~~

−

~~(For advanced users, to save time for downloading the FASTA files (~900MB), you may copy a local copy of GRCh37 FASTA file and the index file to ${EPACTS_DIR}/ext/ref/)~~

−

~~</pre>~~

*Perform a test run by running the following command

−

example/test_run_epacts.sh

+

EPACTS-3.0.0/example/test_run_epacts.sh

=== For Local Users in CSG ===

Line 40: Line 40:

Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are:

−

<pre>$ ~~epacts2~~.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz

+

<pre>$ EPACTS-3.0.0/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz

−

$ ~~epacts2~~.1/example/1000G_dummy_pheno.ped

+

$ EPACTS-3.0.0/example/1000G_dummy_pheno.ped

</pre>

Run the single variant score test on the example data using this command:

−

<pre>$ ~~epacts2~~.1/epacts single

+

<pre>$ EPACTS-3.0.0/epacts single

--vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz

--ped epacts2.1/example/1000G_dummy_pheno.ped

Line 57: Line 57:

== 2.  Prepare VCF file with genotypes / dosages ==

−

EPACTS requires input genotype / ~~doseage~~ information in VCF format.  From minimac or Impute2, you wil start with your imputed dosage file.

+

EPACTS requires input genotype / dosage information in VCF format.  From minimac or Impute2, you wil start with your imputed dosage file.

=== A.  Convert dosage file into VCF format ===

−

Use the wrapper program "dose2vcf" to convert your doseage output to VCF format.  Download the tool from [http://~~www~~.sph.umich.edu/~~csg~~/cfuchsb/dose2vcf_v0.4.~~tgz~~ here]. If you used rs numbers during imputation, you can find mapping tables here

+

Use the wrapper program "dose2vcf" to convert your doseage output to pseudo VCF format.  Download the tool from [http://csg.sph.umich.edu//cfuchsb/dose2vcf_v0.5.gz here]. If you used rs numbers during imputation, you can find mapping tables ready for dose2vcf [http://csg.sph.umich.edu//cfuchsb/mapping_rs_ALL.GIANT.phase1_release_v3.20101123.tgz here (214 Mb) ]

−

+

To run the wrapper program, use the following command

Line 131: Line 131:

<pre>A DISEASE

T QT

−

C AGE</pre>

+

C AGE</pre>

−

Key:  A = binary trait; T = quantitative trait; C = covariate

+

Key:  A = binary trait; T = quantitative trait; C = covariate

== 4.  Run EPACTS association pipeline ==

For detailed description of options, use:

−

<pre>~~epacts2~~.1/epacts single -man

+

<pre>EPACTS-3.0.0/epacts single -man

</pre>

−

+

−

=== Primary analyses ===

+

=== Primary analyses (without BMI) [please submit as soon as the analysis is complete] ===

−

There are '''4''' separate association analyses to be completed.

+

There are '''2''' separate association analyses to be completed '''without adjusting for BMI'''.

{| width="1650" border="1" align="left" cellpadding="1" cellspacing="1"

Line 164: Line 164:

|-

|

−

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#A._Typical_DIAGRAM_analysis_using_existing_association_pipeline 1.  Typical DIAGRAM analysis using existing association pipeline]

+

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#1._Typical_DIAGRAM_analysis_using_existing_association_pipeline A.  Typical DIAGRAM analysis using existing association pipeline]

−

~~(with or without BMI)~~

+

|

Line 182: Line 182:

|

−

A. DIAGRAMv4_iSNPs_XXX_1000G_KKK_TTT_YYY_ZZZ.txt

+

DIAGRAMv4_iSNPs_XXX_1000G_KKK_TTT_YYY_ZZZ.txt

−

~~B. DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_TTT_YYY_ZZZ.txt~~

+

|-

|

−

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#C._Analysis_of_low_frequency_variants_using_Firth_bias-corrected_logistic_regression 2.  Analysis of low frequency variants using Firth bias-corrected logistic regression]

+

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#2._Analysis_of_low_frequency_variants_using_Firth_bias-corrected_logistic_regression B.  Analysis of low frequency variants using Firth bias-corrected logistic regression]

−

~~(with or without BMI)~~

+

|

Line 204: Line 204:

|

−

A. DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz

+

DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz

−

~~B. DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_FBC_YYY_ZZZ.epacts.gz~~

+

|}

Line 224: Line 224:

−

=== Analysis for QC ===

+

=== Analysis for QC [please submit as soon as the analysis is complete] ===

For quality control, please run an additional analysis using EPACTS on all SNPs for chromosome 20 only using the '''SCORE''' test without BMI adjustment.  These results will be used to compare with results from the primary analyses, to ensure the new EPACTS software has been run correctly.

Line 238: Line 238:

| align="center" | '''Output Filename Format'''

|-

−

| [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#3._Analysis_of_chromosome_20_using_logistic_regression_score_test 3.  Analysis of chromosome 20 using logistic regression score test] (without BMI)

+

| [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#3._Analysis_of_chromosome_20_using_logistic_regression_score_test C.  Analysis of chromosome 20 using logistic regression score test] (without BMI)

| Score test

|

Line 248: Line 248:

| A. DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz

|}

+

+

=== Secondary analyses (with BMI) [please submit as soon as the analysis is complete] ===

+

There are '''2''' secondary analyses''' adjusting for BMI'''.

+

{| width="1650" border="1" align="left" cellpadding="1" cellspacing="1"

+

|-

+

! scope="col" |

+

Association Analysis

+

! scope="col" |

+

Statistical Test

+

! scope="col" |

+

Subset of SNPs

+

! scope="col" |

+

Output File Type

+

! scope="col" |

+

Output Filename Format

+

|-

+

|

+

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#4._Typical_DIAGRAM_analysis_using_existing_association_pipeline_.28with_BMI.29 D.  Typical DIAGRAM analysis using existing association pipeline (with BMI)]

+

+

|

+

Wald or likelihood ratio

+

|

+

All SNPs with

+

MAC >= 1

+

|

+

Custom file

+

based on DIAGRAM format

+

|

+

DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_TTT_YYY_ZZZ.txt

+

+

|-

+

|

+

[http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#5._Analysis_of_low_frequency_variants_using_Firth_bias-corrected_logistic_regression_.28with_BMI.29 E.  Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI)]

+

+

|

+

Firth bias-corrected

+

|

+

SNPs with

+

200 >= MAC >= 1

+

|

+

EPACTS output file

+

|

+

DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_FBC_YYY_ZZZ.epacts.gz

+

+

|}

+

+

+

+

+

+

+

+

'''Please send the 2 Primary analyses and the QC analysis when complete.'''

Line 273: Line 360:

Hence, we only ask for Firth test results for SNPs with MAC <= 200.

−

~~=== 1. Typical DIAGRAM analysis using existing association pipeline~~ ~~===~~

+

+

----

−

~~This is the typical~~ DIAGRAM analysis using ~~your current~~ association pipeline ~~and software.   [[Image:1000Genomes march2012 imputation analysis plan 08312012.pdf]]~~

+

=== A. Typical DIAGRAM analysis using existing association pipeline ===

−

=== 2. Analysis of low frequency variants using Firth bias-corrected logistic regression ===

+

This is the typical DIAGRAM analysis using your current association pipeline and software.   [[Image:1000Genomes_march2012_imputation_analysis_plan_08312012_v2.pdf]] (Updated Dec 14, 2012)

+

For frequently asked questions regarding the file format, please see:  [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications]

+

==== Alternative:  Analyze VCF and PED files using the Wald test with the EPACTS software: ====

+

As preparation for the Firth test analysis, we encourage you to analyze the data using the Wald test first, since it is computationally much faster.  This will be a good way to check if your VCF and PED files for every chromosome are correctly formatted for EPACTS and resolve any problems you may have with your imputation or input files.

+

<pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \

+

-test b.wald -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -field EC -run 10

+

</pre>

+

'''Important:''' To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!

+

=== B. Analysis of low frequency variants using Firth bias-corrected logistic regression ===

The Firth bias-corrected test has well-controlled type I error rate and good power for analysis of balanced and unbalanced studies.  However, it is more computationally intensive.  We only run Firth on the subset of variants with 1<= MAC <= 200.

To run the Firth test using the EPACTS software:

−

<pre>~~epacts2~~.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \

+

<pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \

-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -field EC -run 10

</pre>

−

~~ ~~'''Important:'''  To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option.  Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!~~ ~~

+

'''Important:'''  To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option.  Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!

−

=== 3. Analysis of chromosome 20 using logistic regression score test ===

+

=== C. Analysis of chromosome 20 using logistic regression score test ===

The score test has well-controlled type I error rate and good power for meta-analysis of balanced (equal numbers of cases and controls) studies.  It is also very computationally efficient.  Please run the score test using the EPACTS software.

The EPACTS command for the score test analysis of chromosome 20 is:

−

<pre>~~epacts2~~.1/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \

+

<pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \

-test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10

</pre>

−

This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).

+

This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).

+

=== D. Typical DIAGRAM analysis using existing association pipeline (with BMI) ===

+

This is the typical DIAGRAM analysis using your current association pipeline and software including BMI adjustment.

+

'''Alternative:  Analyze VCF and PED files using the Wald test with the EPACTS software:'''

+

−

== 5. ~~ Report EPACTS results ~~ ==

+

=== E. Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI) ===

−

For analyses 2 and 3, please upload the ~~three .~~epacts.gz files to the FTP server:

+

Again use the Firth test on EPACTS for your analysis with BMI

+

== 5.  Report results ==

+

For '''analysis 1''', please follow the following results file guidelines:   [[Image:1000Genomes_march2012_imputation_analysis_plan_08312012_v2.pdf]] (Updated Dec 14, 2012)

+

For frequently asked questions regarding the file format, please see:  [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications]

+

For '''analyses 2 and 3''', please upload the two epacts.gz files to the FTP server:

#'''Firth test (no BMI): ''' DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz

−

~~#'''Firth test (with BMI):'''  DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_FBC_YYY_ZZZ.epacts.gz~~

#'''Score test (chr 20, no BMI):  '''DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz

+

The FTP hostname is:  '''ftp.broadinstitute.org'''.  Please place your files into to the /incoming/ directory.

+

Here's an example score test .epacts file

Line 332: Line 452:

*PVALUE : P-value

−

The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies.

+

The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies.

−

= Troubleshooting Common Issues =

−

== EPACTS installation errors ==

+

== EPACTS installation errors ==

−

== Errors when running EPACTS ==

+

== Errors when running EPACTS ==

=== Rscript execution error: No such file or directory ===

Line 368: Line 484:

If you can find Rscript (e.g. /usr/bin/Rscript, /usr/local/bin/Rscript), or if you can re-install the full Rscript, you can simply avoid the problem by setting your environment variable.

−

Otherwise, Hyun will modify EPACTS to not requiring this (so you can run R CMD BATCH instead of Rscript).

+

Otherwise, Hyun will modify EPACTS to not requiring this (so you can run R CMD BATCH instead of Rscript).

=== ERROR: No overlapping IDs between VCF and PED file. Cannot proceed. ===

Line 398: Line 514:

</pre>

−

The genotype information has FORMAT "GT:EC".  For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000.  To access the dosages, you must specify the option "-field EC"

+

The genotype information has FORMAT "GT:EC".  For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000.  To access the dosages, you must specify the option "-field EC".

+

== Results FIle Clarifications ==

+

=== 1. How do I code the INDEL variant names and alleles? ===

+

Please use the variant name and the allele name directly from IMPUTE or minimac. Please do NOT recode variant names or alleles. We will do this step in the analysis for consistency.

+

ACTION IF YOU HAVE UPLOADED YOUR FILE: If you have recoded your INDEL alleles, please tell us so we can remove your file and let us know when you can reupload with the original variable and allele names.

+

=== 2. The document asks for the number of homozygotes and heterozygotes in case and control. How do I get this from my data? Is this relevant for imputed data? ===

+

These numbers were relevant to genotyped data but not for imputed data. We didn't intend to ask for this. To retain the same file format between results already submitted and those to be submitted please retain the columns with a "." for the value.

+

ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip these columns.

+

=== 3. For the "Imputed" variable, what does imputed mean in the context of the data output from MACH and IMPUTE? ===

+

This is a hold over from the last round of analysis where we asked for results separately from genotyped SNPs and imputed SNPs and wanted to distinguish between the two. We will use r2_hat or info measures to estimate the accuracy of the genotypes. This column will be retained for consistency with files already submitted but should be filled in with "." or "1". It will not be used in the analysis.

+

ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip this column.

Ppwhite

96

edits

Changes

EPACTS for DIAGRAM (view source)

Revision as of 10:05, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools