Difference between revisions of "EPACTS for DIAGRAM"
Clement Ma (talk | contribs) |
|||
(113 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
= Motivation and Rationale = | = Motivation and Rationale = | ||
− | [[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium. In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively. We implemented two statistical tests | + | [[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium. In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively. We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression [http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf (Firth, 1993)]. For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power. For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies. |
= Outline of analysis protocol = | = Outline of analysis protocol = | ||
Line 8: | Line 8: | ||
#Download and install EPACTS | #Download and install EPACTS | ||
− | #Prepare VCF file with genotypes / dosages | + | #Prepare VCF file with genotypes / dosages |
#Prepare PED file with phenotypes and covariates | #Prepare PED file with phenotypes and covariates | ||
#Run EPACTS association pipeline | #Run EPACTS association pipeline | ||
Line 19: | Line 19: | ||
For external users, follow the instruction at [[EPACTS]] page, summarized below. | For external users, follow the instruction at [[EPACTS]] page, summarized below. | ||
− | * | + | *Please download the latest version of EPACTS here: http://csg.sph.umich.edu//kang/epacts/download/ |
− | *Uncompress EPACTS package to the directory you would like to install | + | *Uncompress EPACTS package to the directory you would like to install and then type the following commands |
− | + | > tar xzvf EPACTS-3.0.0.tar.gz | |
− | + | > cd EPACTS-3.0.0 | |
− | + | > ./configure --prefix [INSTALL DIRECTORY] | |
− | + | > make | |
− | + | > make install | |
− | ./ | + | |
*Perform a test run by running the following command | *Perform a test run by running the following command | ||
− | + | EPACTS-3.0.0/example/test_run_epacts.sh | |
=== For Local Users in CSG === | === For Local Users in CSG === | ||
Line 40: | Line 40: | ||
Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are: | Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are: | ||
− | <pre>$ | + | <pre>$ EPACTS-3.0.0/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz |
− | $ | + | $ EPACTS-3.0.0/example/1000G_dummy_pheno.ped |
</pre> | </pre> | ||
<br> Run the single variant score test on the example data using this command: | <br> Run the single variant score test on the example data using this command: | ||
− | <pre>$ | + | <pre>$ EPACTS-3.0.0/epacts single |
--vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz | --vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz | ||
--ped epacts2.1/example/1000G_dummy_pheno.ped | --ped epacts2.1/example/1000G_dummy_pheno.ped | ||
Line 53: | Line 53: | ||
This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001. The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX. The output file directory prefix is {OUTPUT_DIR}/test. Finally, EPACTS will run the analysis in parallel on 2 CPUs. | This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001. The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX. The output file directory prefix is {OUTPUT_DIR}/test. Finally, EPACTS will run the analysis in parallel on 2 CPUs. | ||
− | A more detailed description of the example can be found [http://genome.sph.umich.edu/wiki/Test_EPACTS_for_DIAGRAM here]. | + | A more detailed description of the example can be found [http://genome.sph.umich.edu/wiki/Test_EPACTS_for_DIAGRAM here]. |
− | == 2. Prepare VCF file with genotypes / dosages == | + | == 2. Prepare VCF file with genotypes / dosages == |
− | EPACTS requires input genotype / | + | EPACTS requires input genotype / dosage information in VCF format. From minimac or Impute2, you wil start with your imputed dosage file. |
− | === A. Convert | + | === A. Convert dosage file into VCF format === |
− | Use the wrapper program "dose2vcf" to convert your doseage output to VCF format. Download | + | Use the wrapper program "dose2vcf" to convert your doseage output to pseudo VCF format. Download the tool from [http://csg.sph.umich.edu//cfuchsb/dose2vcf_v0.5.gz here]. If you used rs numbers during imputation, you can find mapping tables ready for dose2vcf [http://csg.sph.umich.edu//cfuchsb/mapping_rs_ALL.GIANT.phase1_release_v3.20101123.tgz here (214 Mb) ] |
+ | |||
+ | <br> | ||
To run the wrapper program, use the following command | To run the wrapper program, use the following command | ||
Line 69: | Line 71: | ||
<pre>> dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1</pre> | <pre>> dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1</pre> | ||
The expected output file is: | The expected output file is: | ||
− | <pre>out/FUSION.GWAS.1KG.imp.chr1.vcf</pre> | + | <pre>out/FUSION.GWAS.1KG.imp.chr1.vcf |
− | === B. bgzip and tabix VCF files === | + | |
+ | </pre> | ||
+ | Note that for longer chromosomes, the program is quite memory intensive. In this case, please convert dosages in shorter sections of the chromosome. For example, if the imputation was performed by sections, then convert these sections to vcf first, and then merge the vcf files together using vcftools [http://vcftools.sourceforge.net/docs.html#concat vcf-concat]: | ||
+ | |||
+ | === B. bgzip and tabix VCF files === | ||
− | Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF | + | Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF |
<pre>bgzip input.vcf ## this command will produce input.vcf.gz | <pre>bgzip input.vcf ## this command will produce input.vcf.gz | ||
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi | tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi | ||
− | </pre> | + | </pre> |
− | If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.<br>Sample IDs in the VCF file must be consistent to those from PED file | + | If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.<br>Sample IDs in the VCF file must be consistent to those from PED file |
== 3. Prepare PED file for phenotypes and covariates == | == 3. Prepare PED file for phenotypes and covariates == | ||
Line 95: | Line 101: | ||
etc. | etc. | ||
+ | |||
+ | '''Note:''' Categorical covariates must be coded as dummy variables! For example, Sex cannot be coded as "M" or "F". | ||
An example PED file with a header is as follows. Note that the header must start with a "#" symbol. | An example PED file with a header is as follows. Note that the header must start with a "#" symbol. | ||
Line 121: | Line 129: | ||
</pre> | </pre> | ||
The corresponding DAT file is: | The corresponding DAT file is: | ||
− | <pre> | + | <pre>A DISEASE |
− | + | T QT | |
− | + | C AGE</pre> | |
− | </pre> | + | Key: A = binary trait; T = quantitative trait; C = covariate<br> |
== 4. Run EPACTS association pipeline == | == 4. Run EPACTS association pipeline == | ||
− | + | For detailed description of options, use: | |
+ | <pre>EPACTS-3.0.0/epacts single -man | ||
+ | </pre> | ||
+ | <br> | ||
− | {| width=" | + | === Primary analyses (without BMI) [please submit as soon as the analysis is complete] === |
+ | |||
+ | There are '''2''' separate association analyses to be completed '''without adjusting for BMI'''. | ||
+ | |||
+ | {| width="1650" border="1" align="left" cellpadding="1" cellspacing="1" | ||
|- | |- | ||
− | ! scope="col" | Association Analysis | + | ! scope="col" | |
− | ! scope="col" | Statistical Test | + | Association Analysis |
− | ! scope="col" | Subset of SNPs | + | |
− | ! scope="col" | Output File Type | + | ! scope="col" | |
− | ! scope="col" | Output Filename Format | + | Statistical Test |
+ | |||
+ | ! scope="col" | | ||
+ | Subset of SNPs | ||
+ | |||
+ | ! scope="col" | | ||
+ | Output File Type | ||
+ | |||
+ | ! scope="col" | | ||
+ | Output Filename Format | ||
+ | |||
|- | |- | ||
− | | A. Typical DIAGRAM analysis using existing association pipeline | + | | |
− | | Wald or likelihood ratio | + | [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#1._Typical_DIAGRAM_analysis_using_existing_association_pipeline A. Typical DIAGRAM analysis using existing association pipeline] |
− | | All SNPs with | + | |
− | | Custom file based on DIAGRAM format | + | <br> |
− | | | + | |
+ | | | ||
+ | Wald or likelihood ratio | ||
+ | |||
+ | | | ||
+ | All SNPs with | ||
+ | |||
+ | MAC >= 1 | ||
+ | |||
+ | | | ||
+ | Custom file | ||
+ | |||
+ | based on DIAGRAM format | ||
+ | |||
+ | | | ||
+ | DIAGRAMv4_iSNPs_XXX_1000G_KKK_TTT_YYY_ZZZ.txt | ||
+ | |||
+ | <br> | ||
+ | |||
|- | |- | ||
− | | B. Analysis of all SNPs using logistic regression score test | + | | |
− | | Score test | + | [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#2._Analysis_of_low_frequency_variants_using_Firth_bias-corrected_logistic_regression B. Analysis of low frequency variants using Firth bias-corrected logistic regression] |
− | | | + | |
− | | EPACTS output file | + | <br> |
− | | | + | |
+ | | | ||
+ | Firth bias-corrected | ||
+ | |||
+ | | | ||
+ | SNPs with | ||
+ | |||
+ | 200 >= MAC >= 1 | ||
+ | |||
+ | | | ||
+ | EPACTS output file | ||
+ | |||
+ | | | ||
+ | DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz | ||
+ | |||
+ | <br> | ||
+ | |||
+ | |} | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === Analysis for QC [please submit as soon as the analysis is complete] === | ||
+ | |||
+ | For quality control, please run an additional analysis using EPACTS on all SNPs for chromosome 20 only using the '''SCORE''' test without BMI adjustment. These results will be used to compare with results from the primary analyses, to ensure the new EPACTS software has been run correctly. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | {| width="1650" border="1" cellpadding="1" cellspacing="1" align="center" | ||
+ | |- | ||
+ | | align="center" | '''Association Analysis''' | ||
+ | | align="center" | '''Statistical Test''' | ||
+ | | align="center" | '''Subset of SNPs''' | ||
+ | | align="center" | '''Output File Type''' | ||
+ | | align="center" | '''Output Filename Format''' | ||
+ | |- | ||
+ | | [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#3._Analysis_of_chromosome_20_using_logistic_regression_score_test C. Analysis of chromosome 20 using logistic regression score test] (without BMI) | ||
+ | | Score test | ||
+ | | | ||
+ | Chromosome 20 SNPs | ||
+ | |||
+ | with MAC >= 1 | ||
+ | |||
+ | | EPACTS output file | ||
+ | | A. DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz | ||
+ | |} | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === Secondary analyses (with BMI) [please submit as soon as the analysis is complete] === | ||
+ | |||
+ | There are '''2''' secondary analyses''' adjusting for BMI'''. | ||
+ | |||
+ | {| width="1650" border="1" align="left" cellpadding="1" cellspacing="1" | ||
+ | |- | ||
+ | ! scope="col" | | ||
+ | Association Analysis | ||
+ | |||
+ | ! scope="col" | | ||
+ | Statistical Test | ||
+ | |||
+ | ! scope="col" | | ||
+ | Subset of SNPs | ||
+ | |||
+ | ! scope="col" | | ||
+ | Output File Type | ||
+ | |||
+ | ! scope="col" | | ||
+ | Output Filename Format | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#4._Typical_DIAGRAM_analysis_using_existing_association_pipeline_.28with_BMI.29 D. Typical DIAGRAM analysis using existing association pipeline (with BMI)] | ||
+ | |||
+ | <br> | ||
+ | |||
+ | | | ||
+ | Wald or likelihood ratio | ||
+ | |||
+ | | | ||
+ | All SNPs with | ||
+ | |||
+ | MAC >= 1 | ||
+ | |||
+ | | | ||
+ | Custom file | ||
+ | |||
+ | based on DIAGRAM format | ||
+ | |||
+ | | | ||
+ | DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_TTT_YYY_ZZZ.txt | ||
+ | |||
+ | <br> | ||
+ | |||
|- | |- | ||
− | | | + | | |
− | | Firth bias-corrected | + | [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#5._Analysis_of_low_frequency_variants_using_Firth_bias-corrected_logistic_regression_.28with_BMI.29 E. Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI)] |
− | | SNPs with 200 >= MAC >= 1 | + | |
− | | EPACTS output file | + | <br> |
− | | | + | |
+ | | | ||
+ | Firth bias-corrected | ||
+ | |||
+ | | | ||
+ | SNPs with | ||
+ | |||
+ | 200 >= MAC >= 1 | ||
+ | |||
+ | | | ||
+ | EPACTS output file | ||
+ | |||
+ | | | ||
+ | DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_FBC_YYY_ZZZ.epacts.gz | ||
+ | |||
+ | <br> | ||
+ | |||
|} | |} | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | '''Please send the 2 Primary analyses and the QC analysis when complete.''' | ||
− | + | <br> | |
− | + | ==== Filename key: ==== | |
− | + | '''iSNPs''' indicates the file contains imputed SNPs only<br>'''XXX '''indicates a uniquely identifiable STUDY NAME: (e.g.WTCCC, DGI, DGDG, FUSION, ERGO, DUNDEE, NHS,FHS, TYROL, EUROSPAN etc.)<br>'''“_adjBMI”''' – should be present for the model with BMI adjustment<br>'''KKK '''indicates date of the 1000Genomes map, Month and year (defined as MMMYY) of the 1000Genomes map that was used e.g. MAR12<br>'''TTT '''indicates the test used to evaluate significance | |
− | + | *SCR = Score | |
+ | *WLD = Wald | ||
+ | *LHR = Likelihood ratio | ||
+ | *FBC = Firth Bias Corrected Logistic Regression | ||
− | + | '''YYY '''indicates the DATE of file generation (MMDDYY format, e.g. 021710 – apologies in advance to our European colleagues)<br>'''ZZZ '''indicates the name + other initials of the uploader (e.g., BFV, LJS, ABC, etc.) | |
− | <pre> | + | |
− | -test b. | + | ==== Comparative runtimes for Firth and score tests. ==== |
+ | |||
+ | Firth test is relatively computationally intensive compared to the score test. However, Firth has better test properties (i.e. type I error rates and power) compare to score test when analyzing low frequency variants in studies with unequal numbers of cases and controls. | ||
+ | |||
+ | For example, using a single CPU, analyzing 2333 FUSION individuals, chromosome 20, 22755 SNPs (adjusting for age, sex and 11 dummy variables for birth province): | ||
+ | |||
+ | *Firth test: 56 minutes (3360 seconds) | ||
+ | *Score test: 32 seconds | ||
+ | |||
+ | <br> Hence, we only ask for Firth test results for SNPs with MAC <= 200. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | === A. Typical DIAGRAM analysis using existing association pipeline<br> === | ||
+ | |||
+ | This is the typical DIAGRAM analysis using your current association pipeline and software. [[Image:1000Genomes_march2012_imputation_analysis_plan_08312012_v2.pdf]] (Updated Dec 14, 2012) | ||
+ | |||
+ | For frequently asked questions regarding the file format, please see: [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications] | ||
+ | |||
+ | ==== Alternative: Analyze VCF and PED files using the Wald test with the EPACTS software: ==== | ||
+ | |||
+ | As preparation for the Firth test analysis, we encourage you to analyze the data using the Wald test first, since it is computationally much faster. This will be a good way to check if your VCF and PED files for every chromosome are correctly formatted for EPACTS and resolve any problems you may have with your imputation or input files. | ||
+ | <pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ | ||
+ | -test b.wald -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -field EC -run 10 | ||
</pre> | </pre> | ||
− | + | '''Important:''' To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")! | |
− | === | + | === B. Analysis of low frequency variants using Firth bias-corrected logistic regression === |
− | The Firth bias-corrected test has well-controlled type I error rate and good power for analysis of balanced and unbalanced studies. However, it is more computationally intensive. We only run Firth on the subset of variants with 1<= MAC <= 200. | + | The Firth bias-corrected test has well-controlled type I error rate and good power for analysis of balanced and unbalanced studies. However, it is more computationally intensive. We only run Firth on the subset of variants with 1<= MAC <= 200. |
− | To run the Firth test using the EPACTS software: | + | To run the Firth test using the EPACTS software: |
− | <pre> | + | <pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ |
− | -test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -run 10 | + | -test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -field EC -run 10 |
</pre> | </pre> | ||
+ | '''Important:''' To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")! | ||
+ | === C. Analysis of chromosome 20 using logistic regression score test === | ||
− | For | + | The score test has well-controlled type I error rate and good power for meta-analysis of balanced (equal numbers of cases and controls) studies. It is also very computationally efficient. Please run the score test using the EPACTS software. |
− | <pre>epacts2.1/epacts single - | + | |
− | </pre> | + | The EPACTS command for the score test analysis of chromosome 20 is: |
+ | <pre>EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ | ||
+ | -test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10 | ||
+ | </pre> | ||
+ | This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10). | ||
+ | |||
+ | === D. Typical DIAGRAM analysis using existing association pipeline (with BMI)<br> === | ||
+ | |||
+ | This is the typical DIAGRAM analysis using your current association pipeline and software including BMI adjustment. | ||
+ | |||
+ | '''Alternative: Analyze VCF and PED files using the Wald test with the EPACTS software:''' | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === E. Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI) === | ||
+ | |||
+ | Again use the Firth test on EPACTS for your analysis with BMI | ||
+ | |||
+ | == 5. Report results<br> == | ||
+ | For '''analysis 1''', please follow the following results file guidelines: [[Image:1000Genomes_march2012_imputation_analysis_plan_08312012_v2.pdf]] (Updated Dec 14, 2012) | ||
+ | |||
+ | For frequently asked questions regarding the file format, please see: [http://genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications] | ||
+ | |||
+ | |||
+ | For '''analyses 2 and 3''', please upload the two epacts.gz files to the FTP server: | ||
+ | |||
+ | #'''Firth test (no BMI): ''' DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz | ||
+ | #'''Score test (chr 20, no BMI): '''DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz | ||
+ | |||
+ | <br> | ||
+ | |||
+ | The FTP hostname is: '''ftp.broadinstitute.org'''. Please place your files into to the /incoming/ directory. | ||
+ | |||
+ | |||
+ | |||
+ | Here's an example score test .epacts file | ||
+ | <pre>$ head test.epacts | ||
+ | #CHROM BEGIN END MARKER_ID NS AC CALLRATE MAF PVALUE SCORE N.CASE N.CTRL AF.CASE AF.CTRL | ||
+ | 20 68303 68303 20:68303_A/G_Upstream:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA | ||
+ | 20 68319 68319 20:68319_C/A_Upstream:DEFB125 266 0 1 0 NA NA NA NA NA NA | ||
+ | 20 68396 68396 20:68396_C/T_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA | ||
+ | 20 76635 76635 20:76635_A/T_Intron:DEFB125 266 0 1 0 NA NA NA NA NA NA | ||
+ | 20 76689 76689 20:76689_T/C_Synonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA | ||
+ | 20 76690 76690 20:76690_T/C_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA | ||
+ | 20 76700 76700 20:76700_G/A_Nonsynonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA | ||
+ | 20 76726 76726 20:76726_C/G_Nonsynonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA | ||
+ | 20 76771 76771 20:76771_C/T_Nonsynonymous:DEFB125 266 3 1 0.0056391 0.68484 0.40587 145 121 0.013793 0.0082645 | ||
+ | |||
+ | </pre> | ||
+ | Each column represents | ||
+ | |||
+ | *CHROM : Chromosome Name | ||
+ | *BEGIN, END : Base position of the variant on each side | ||
+ | *MARKER_ID : [CHROM]:[POS]_[REF]/[ALT]_[ANNOTATION] formatted marker ID. [ANNOTATION] information will be available only with --anno option | ||
+ | *NS : Number of samples with non-missing genotypes | ||
+ | *AC : Non-reference allele count | ||
+ | *CALLRATE : Genotype call rate | ||
+ | *MAF : Minor allele frequency | ||
+ | *PVALUE : P-value | ||
+ | |||
+ | The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies. | ||
+ | |||
+ | = Troubleshooting Common Issues = | ||
+ | |||
+ | == EPACTS installation errors == | ||
+ | |||
+ | == Errors when running EPACTS == | ||
+ | |||
+ | === Rscript execution error: No such file or directory === | ||
+ | <pre>epacts2.1/epacts single \ | ||
+ | |||
+ | --vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz \ | ||
+ | --ped epacts2.1/example/1000G_dummy_pheno.ped \ | ||
+ | --min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.score --anno \ | ||
+ | --out out/test --run 1 & | ||
+ | [4] 13569 | ||
+ | |||
+ | $ Detected phenotypes with 2 unique values - 1 and 2 - considering them as binary phenotypes... re-encoding them into 1 and 2 | ||
+ | Successfully written phenotypes and 2 covariates across 266 individuals | ||
+ | Processing chromosome 20... | ||
+ | Finished generating EPACTS Makefile | ||
+ | Running 1 parallel jobs of EPACTS | ||
+ | epacts2.1/bin/..//bin/make -f out/test.Makefile -j 1 | ||
+ | |||
+ | Rscript execution error: No such file or directory | ||
+ | |||
+ | |||
+ | </pre> | ||
+ | The issue seems to be the absence with Rscript, which is usually installed with R. | ||
+ | |||
+ | If you can find Rscript (e.g. /usr/bin/Rscript, /usr/local/bin/Rscript), or if you can re-install the full Rscript, you can simply avoid the problem by setting your environment variable. | ||
+ | |||
+ | Otherwise, Hyun will modify EPACTS to not requiring this (so you can run R CMD BATCH instead of Rscript). | ||
+ | |||
+ | === ERROR: No overlapping IDs between VCF and PED file. Cannot proceed. === | ||
+ | |||
+ | Check that your individual ID's in your PED file are the same as those in your VCF file. | ||
+ | |||
+ | For example, if your VCF individual ID's include the family ID's (i.e. ABCD->ABCD001), the individual ID's in the PED file must match it exactly. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === Estimated allele frequencies and analysis results do not exactly match results from my existing association software === | ||
+ | |||
+ | Check that you have included the same set of covariates (with categorical variables encoded as dummy variables). | ||
+ | |||
+ | Check that you have the same number of cases and controls analyzed. | ||
+ | |||
+ | FInally, check that you used dosages by adding the appropriate "-field" option. For example, suppose your VCF is: | ||
− | + | <br> | |
+ | <pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A001 B001 C001 | ||
+ | 11 180567 11:180567 C G 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 | ||
+ | 11 186458 11:186458 G A 0 PASS . GT:EC 1/1:1.9850 1/1:1.9750 1/1:1.9840 | ||
+ | 11 186462 11:186462 C A 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 | ||
+ | 11 192958 11:192958 G T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 | ||
+ | 11 192995 11:192995 C T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 | ||
+ | 11 193065 11:193065 G A 0 PASS . GT:EC 1/1:1.9980 1/1:1.9990 1/1:1.9960 | ||
+ | 11 193096 11:193096 C T 0 PASS . GT:EC 0/1:0.7840 0/1:0.6280 1/1:1.6550 | ||
+ | 11 193146 11:193146 G A 0 PASS . GT:EC 1/1:1.8550 1/1:1.8460 1/1:1.7940 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</pre> | </pre> | ||
− | + | <br> The genotype information has FORMAT "GT:EC". For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000. To access the dosages, you must specify the option "-field EC". | |
+ | |||
+ | |||
+ | |||
+ | == Results FIle Clarifications == | ||
+ | |||
+ | === 1. How do I code the INDEL variant names and alleles? === | ||
+ | |||
+ | Please use the variant name and the allele name directly from IMPUTE or minimac. Please do NOT recode variant names or alleles. We will do this step in the analysis for consistency. | ||
+ | |||
+ | ACTION IF YOU HAVE UPLOADED YOUR FILE: If you have recoded your INDEL alleles, please tell us so we can remove your file and let us know when you can reupload with the original variable and allele names. | ||
+ | |||
+ | === <br>2. The document asks for the number of homozygotes and heterozygotes in case and control. How do I get this from my data? Is this relevant for imputed data? === | ||
+ | |||
+ | These numbers were relevant to genotyped data but not for imputed data. We didn't intend to ask for this. To retain the same file format between results already submitted and those to be submitted please retain the columns with a "." for the value. | ||
+ | |||
+ | ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip these columns. | ||
+ | |||
+ | === 3. For the "Imputed" variable, what does imputed mean in the context of the data output from MACH and IMPUTE? === | ||
+ | |||
+ | This is a hold over from the last round of analysis where we asked for results separately from genotyped SNPs and imputed SNPs and wanted to distinguish between the two. We will use r2_hat or info measures to estimate the accuracy of the genotypes. This column will be retained for consistency with files already submitted but should be filled in with "." or "1". It will not be used in the analysis. | ||
+ | |||
+ | ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip this column. |
Latest revision as of 10:05, 2 February 2017
Motivation and Rationale
EPACTS is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium. In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively. We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression (Firth, 1993). For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power. For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies.
Outline of analysis protocol
This is an overview of the analysis protocol for analyzing imputed DIAGRAM datasets using the EPACTS pipeline. We assume that your dataset has been imputed using minimac or Impute2. Starting with minimac or impute2 output:
- Download and install EPACTS
- Prepare VCF file with genotypes / dosages
- Prepare PED file with phenotypes and covariates
- Run EPACTS association pipeline
- Report association results in appropriate format
1. Download and install EPACTS
For External Users
For external users, follow the instruction at EPACTS page, summarized below.
- Please download the latest version of EPACTS here: http://csg.sph.umich.edu//kang/epacts/download/
- Uncompress EPACTS package to the directory you would like to install and then type the following commands
> tar xzvf EPACTS-3.0.0.tar.gz > cd EPACTS-3.0.0 > ./configure --prefix [INSTALL DIRECTORY] > make > make install
- Perform a test run by running the following command
EPACTS-3.0.0/example/test_run_epacts.sh
For Local Users in CSG
For users in CSG, EPACTS can be found here:
/net/fantasia/home/hmkang/sw/epacts2.1/
Getting started with an example
Once installed, test out the software by running a quick example using the test data provided in the "example" directory. The example VCF and PED files are:
$ EPACTS-3.0.0/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz $ EPACTS-3.0.0/example/1000G_dummy_pheno.ped
Run the single variant score test on the example data using this command:
$ EPACTS-3.0.0/epacts single --vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz --ped epacts2.1/example/1000G_dummy_pheno.ped --min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.score --anno --out {OUTPUT_DIR}/test --run 2 &
This command will run the single variant test on the input VCF and PED files, with a minimum MAF threshold of 0.001. The phenotype is "DISEASE" and we are adjusting the analysis with covariates AGE and SEX. The output file directory prefix is {OUTPUT_DIR}/test. Finally, EPACTS will run the analysis in parallel on 2 CPUs.
A more detailed description of the example can be found here.
2. Prepare VCF file with genotypes / dosages
EPACTS requires input genotype / dosage information in VCF format. From minimac or Impute2, you wil start with your imputed dosage file.
A. Convert dosage file into VCF format
Use the wrapper program "dose2vcf" to convert your doseage output to pseudo VCF format. Download the tool from here. If you used rs numbers during imputation, you can find mapping tables ready for dose2vcf here (214 Mb)
To run the wrapper program, use the following command
> dose2vcf/dose2vcf --dose [ .dose file ] --info [ .info file ] --out [ output file prefix ]
For example, for imputed chromosome 1 data from FUSION:
> dose2vcf/dose2vcf --dose FUSION.GWAS.1KG.imp.chr1.dose --info FUSION.GWAS.1KG.imp.chr1.info --out out/FUSION.GWAS.1KG.imp.chr1
The expected output file is:
out/FUSION.GWAS.1KG.imp.chr1.vcf
Note that for longer chromosomes, the program is quite memory intensive. In this case, please convert dosages in shorter sections of the chromosome. For example, if the imputation was performed by sections, then convert these sections to vcf first, and then merge the vcf files together using vcftools vcf-concat:
B. bgzip and tabix VCF files
Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF
bgzip input.vcf ## this command will produce input.vcf.gz tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi
If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.
Sample IDs in the VCF file must be consistent to those from PED file
3. Prepare PED file for phenotypes and covariates
EPACTS accepts the PED format supported by MERLIN or PLINK to represent the phenotypes and covariates. You may prepare either (1) a PED file without column headers + accompanying DAT file, or (2) a PED file with column headers. The standard PED format has 6 mandatory columns:
- Family ID
- Individual ID
- Paternal ID
- Maternal ID
- Sex (1=male; 2=female; other=unknown)
- Phenotype (1 = control; 2 = case)
Columns 7 and onwards are additonal covariates and or phenotypes. For example
- QT
- AGE
etc.
Note: Categorical covariates must be coded as dummy variables! For example, Sex cannot be coded as "M" or "F".
An example PED file with a header is as follows. Note that the header must start with a "#" symbol.
#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE 13281 NA12344 NA12347 NA12348 1 1 94.17 66.1 13281 NA12347 0 0 1 1 109.54 44.0 13281 NA12348 0 0 2 2 119.40 46.6 1328 NA06984 0 0 1 2 87.72 39.3 1328 NA06989 0 0 2 1 100.60 41.7 1328 NA12329 NA06984 NA06989 2 1 100.85 46.4 13291 NA06986 0 0 1 2 91.94 61.9 13291 NA06995 NA07435 NA07037 1 2 104.36 57.4 13291 NA06997 NA06986 NA07045 2 2 107.53 53.1
Alternatively, you can prepare a PED file without a header, and include a corresponding DAT file describing the column headers.
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1 13281 NA12347 0 0 1 1 109.54 44.0 13281 NA12348 0 0 2 2 119.40 46.6 1328 NA06984 0 0 1 2 87.72 39.3 1328 NA06989 0 0 2 1 100.60 41.7 1328 NA12329 NA06984 NA06989 2 1 100.85 46.4 13291 NA06986 0 0 1 2 91.94 61.9 13291 NA06995 NA07435 NA07037 1 2 104.36 57.4 13291 NA06997 NA06986 NA07045 2 2 107.53 53.1
The corresponding DAT file is:
A DISEASE T QT C AGE
Key: A = binary trait; T = quantitative trait; C = covariate
4. Run EPACTS association pipeline
For detailed description of options, use:
EPACTS-3.0.0/epacts single -man
Primary analyses (without BMI) [please submit as soon as the analysis is complete]
There are 2 separate association analyses to be completed without adjusting for BMI.
Association Analysis |
Statistical Test |
Subset of SNPs |
Output File Type |
Output Filename Format |
---|---|---|---|---|
A. Typical DIAGRAM analysis using existing association pipeline
|
Wald or likelihood ratio |
All SNPs with MAC >= 1 |
Custom file based on DIAGRAM format |
DIAGRAMv4_iSNPs_XXX_1000G_KKK_TTT_YYY_ZZZ.txt
|
B. Analysis of low frequency variants using Firth bias-corrected logistic regression
|
Firth bias-corrected |
SNPs with 200 >= MAC >= 1 |
EPACTS output file |
DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz
|
Analysis for QC [please submit as soon as the analysis is complete]
For quality control, please run an additional analysis using EPACTS on all SNPs for chromosome 20 only using the SCORE test without BMI adjustment. These results will be used to compare with results from the primary analyses, to ensure the new EPACTS software has been run correctly.
Association Analysis | Statistical Test | Subset of SNPs | Output File Type | Output Filename Format |
C. Analysis of chromosome 20 using logistic regression score test (without BMI) | Score test |
Chromosome 20 SNPs with MAC >= 1 |
EPACTS output file | A. DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz |
Secondary analyses (with BMI) [please submit as soon as the analysis is complete]
There are 2 secondary analyses adjusting for BMI.
Association Analysis |
Statistical Test |
Subset of SNPs |
Output File Type |
Output Filename Format |
---|---|---|---|---|
D. Typical DIAGRAM analysis using existing association pipeline (with BMI)
|
Wald or likelihood ratio |
All SNPs with MAC >= 1 |
Custom file based on DIAGRAM format |
DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_TTT_YYY_ZZZ.txt
|
E. Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI)
|
Firth bias-corrected |
SNPs with 200 >= MAC >= 1 |
EPACTS output file |
DIAGRAMv4_iSNPs_XXX_adjBMI_1000G_KKK_FBC_YYY_ZZZ.epacts.gz
|
Please send the 2 Primary analyses and the QC analysis when complete.
Filename key:
iSNPs indicates the file contains imputed SNPs only
XXX indicates a uniquely identifiable STUDY NAME: (e.g.WTCCC, DGI, DGDG, FUSION, ERGO, DUNDEE, NHS,FHS, TYROL, EUROSPAN etc.)
“_adjBMI” – should be present for the model with BMI adjustment
KKK indicates date of the 1000Genomes map, Month and year (defined as MMMYY) of the 1000Genomes map that was used e.g. MAR12
TTT indicates the test used to evaluate significance
- SCR = Score
- WLD = Wald
- LHR = Likelihood ratio
- FBC = Firth Bias Corrected Logistic Regression
YYY indicates the DATE of file generation (MMDDYY format, e.g. 021710 – apologies in advance to our European colleagues)
ZZZ indicates the name + other initials of the uploader (e.g., BFV, LJS, ABC, etc.)
Comparative runtimes for Firth and score tests.
Firth test is relatively computationally intensive compared to the score test. However, Firth has better test properties (i.e. type I error rates and power) compare to score test when analyzing low frequency variants in studies with unequal numbers of cases and controls.
For example, using a single CPU, analyzing 2333 FUSION individuals, chromosome 20, 22755 SNPs (adjusting for age, sex and 11 dummy variables for birth province):
- Firth test: 56 minutes (3360 seconds)
- Score test: 32 seconds
Hence, we only ask for Firth test results for SNPs with MAC <= 200.
A. Typical DIAGRAM analysis using existing association pipeline
This is the typical DIAGRAM analysis using your current association pipeline and software. File:1000Genomes march2012 imputation analysis plan 08312012 v2.pdf (Updated Dec 14, 2012)
For frequently asked questions regarding the file format, please see: genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications
Alternative: Analyze VCF and PED files using the Wald test with the EPACTS software:
As preparation for the Firth test analysis, we encourage you to analyze the data using the Wald test first, since it is computationally much faster. This will be a good way to check if your VCF and PED files for every chromosome are correctly formatted for EPACTS and resolve any problems you may have with your imputation or input files.
EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ -test b.wald -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -field EC -run 10
Important: To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!
B. Analysis of low frequency variants using Firth bias-corrected logistic regression
The Firth bias-corrected test has well-controlled type I error rate and good power for analysis of balanced and unbalanced studies. However, it is more computationally intensive. We only run Firth on the subset of variants with 1<= MAC <= 200.
To run the Firth test using the EPACTS software:
EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ -test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -field EC -run 10
Important: To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!
C. Analysis of chromosome 20 using logistic regression score test
The score test has well-controlled type I error rate and good power for meta-analysis of balanced (equal numbers of cases and controls) studies. It is also very computationally efficient. Please run the score test using the EPACTS software.
The EPACTS command for the score test analysis of chromosome 20 is:
EPACTS-3.0.0/epacts single -vcf [INPUT VCF FILENAME] -ped [INPUT PED FILENAME] -out [OUTPUT FILENAME PREFIX] \ -test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10
This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).
D. Typical DIAGRAM analysis using existing association pipeline (with BMI)
This is the typical DIAGRAM analysis using your current association pipeline and software including BMI adjustment.
Alternative: Analyze VCF and PED files using the Wald test with the EPACTS software:
E. Analysis of low frequency variants using Firth bias-corrected logistic regression (with BMI)
Again use the Firth test on EPACTS for your analysis with BMI
5. Report results
For analysis 1, please follow the following results file guidelines: File:1000Genomes march2012 imputation analysis plan 08312012 v2.pdf (Updated Dec 14, 2012)
For frequently asked questions regarding the file format, please see: genome.sph.umich.edu/wiki/EPACTS_for_DIAGRAM#Results_FIle_Clarifications
For analyses 2 and 3, please upload the two epacts.gz files to the FTP server:
- Firth test (no BMI): DIAGRAMv4_iSNPs_XXX_1000G_KKK_FBC_YYY_ZZZ.epacts.gz
- Score test (chr 20, no BMI): DIAGRAMv4_iSNPs_XXX_1000G_KKK_SCR_YYY_ZZZ.epacts.gz
The FTP hostname is: ftp.broadinstitute.org. Please place your files into to the /incoming/ directory.
Here's an example score test .epacts file
$ head test.epacts #CHROM BEGIN END MARKER_ID NS AC CALLRATE MAF PVALUE SCORE N.CASE N.CTRL AF.CASE AF.CTRL 20 68303 68303 20:68303_A/G_Upstream:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA 20 68319 68319 20:68319_C/A_Upstream:DEFB125 266 0 1 0 NA NA NA NA NA NA 20 68396 68396 20:68396_C/T_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA 20 76635 76635 20:76635_A/T_Intron:DEFB125 266 0 1 0 NA NA NA NA NA NA 20 76689 76689 20:76689_T/C_Synonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA 20 76690 76690 20:76690_T/C_Nonsynonymous:DEFB125 266 1 1 0.0018797 NA NA NA NA NA NA 20 76700 76700 20:76700_G/A_Nonsynonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA 20 76726 76726 20:76726_C/G_Nonsynonymous:DEFB125 266 0 1 0 NA NA NA NA NA NA 20 76771 76771 20:76771_C/T_Nonsynonymous:DEFB125 266 3 1 0.0056391 0.68484 0.40587 145 121 0.013793 0.0082645
Each column represents
- CHROM : Chromosome Name
- BEGIN, END : Base position of the variant on each side
- MARKER_ID : [CHROM]:[POS]_[REF]/[ALT]_[ANNOTATION] formatted marker ID. [ANNOTATION] information will be available only with --anno option
- NS : Number of samples with non-missing genotypes
- AC : Non-reference allele count
- CALLRATE : Genotype call rate
- MAF : Minor allele frequency
- PVALUE : P-value
The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies.
Troubleshooting Common Issues
EPACTS installation errors
Errors when running EPACTS
Rscript execution error: No such file or directory
epacts2.1/epacts single \ --vcf epacts2.1/example/1000G_exome_chr20_example_softFiltered.calls.vcf.gz \ --ped epacts2.1/example/1000G_dummy_pheno.ped \ --min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.score --anno \ --out out/test --run 1 & [4] 13569 $ Detected phenotypes with 2 unique values - 1 and 2 - considering them as binary phenotypes... re-encoding them into 1 and 2 Successfully written phenotypes and 2 covariates across 266 individuals Processing chromosome 20... Finished generating EPACTS Makefile Running 1 parallel jobs of EPACTS epacts2.1/bin/..//bin/make -f out/test.Makefile -j 1 Rscript execution error: No such file or directory
The issue seems to be the absence with Rscript, which is usually installed with R.
If you can find Rscript (e.g. /usr/bin/Rscript, /usr/local/bin/Rscript), or if you can re-install the full Rscript, you can simply avoid the problem by setting your environment variable.
Otherwise, Hyun will modify EPACTS to not requiring this (so you can run R CMD BATCH instead of Rscript).
ERROR: No overlapping IDs between VCF and PED file. Cannot proceed.
Check that your individual ID's in your PED file are the same as those in your VCF file.
For example, if your VCF individual ID's include the family ID's (i.e. ABCD->ABCD001), the individual ID's in the PED file must match it exactly.
Estimated allele frequencies and analysis results do not exactly match results from my existing association software
Check that you have included the same set of covariates (with categorical variables encoded as dummy variables).
Check that you have the same number of cases and controls analyzed.
FInally, check that you used dosages by adding the appropriate "-field" option. For example, suppose your VCF is:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A001 B001 C001 11 180567 11:180567 C G 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 11 186458 11:186458 G A 0 PASS . GT:EC 1/1:1.9850 1/1:1.9750 1/1:1.9840 11 186462 11:186462 C A 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 11 192958 11:192958 G T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 11 192995 11:192995 C T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 11 193065 11:193065 G A 0 PASS . GT:EC 1/1:1.9980 1/1:1.9990 1/1:1.9960 11 193096 11:193096 C T 0 PASS . GT:EC 0/1:0.7840 0/1:0.6280 1/1:1.6550 11 193146 11:193146 G A 0 PASS . GT:EC 1/1:1.8550 1/1:1.8460 1/1:1.7940
The genotype information has FORMAT "GT:EC". For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000. To access the dosages, you must specify the option "-field EC".
Results FIle Clarifications
1. How do I code the INDEL variant names and alleles?
Please use the variant name and the allele name directly from IMPUTE or minimac. Please do NOT recode variant names or alleles. We will do this step in the analysis for consistency.
ACTION IF YOU HAVE UPLOADED YOUR FILE: If you have recoded your INDEL alleles, please tell us so we can remove your file and let us know when you can reupload with the original variable and allele names.
2. The document asks for the number of homozygotes and heterozygotes in case and control. How do I get this from my data? Is this relevant for imputed data?
These numbers were relevant to genotyped data but not for imputed data. We didn't intend to ask for this. To retain the same file format between results already submitted and those to be submitted please retain the columns with a "." for the value.
ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip these columns.
3. For the "Imputed" variable, what does imputed mean in the context of the data output from MACH and IMPUTE?
This is a hold over from the last round of analysis where we asked for results separately from genotyped SNPs and imputed SNPs and wanted to distinguish between the two. We will use r2_hat or info measures to estimate the accuracy of the genotypes. This column will be retained for consistency with files already submitted but should be filled in with "." or "1". It will not be used in the analysis.
ACTION IF YOU HAVE UPLOADED YOUR FILE: No action. You do not need to redo the file. We will skip this column.