Line 1: |
Line 1: |
| = Motivation and Rationale = | | = Motivation and Rationale = |
| | | |
− | [[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium. In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively. We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression [http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf (Firth, 1993)]. For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power. For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies. | + | [[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium. In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively. We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression [http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf (Firth, 1993)]. For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power. For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies. |
| | | |
| = Outline of analysis protocol = | | = Outline of analysis protocol = |
Line 126: |
Line 126: |
| M QT | | M QT |
| M AGE | | M AGE |
− | </pre> | + | </pre> |
− | | |
| == 4. Run EPACTS association pipeline == | | == 4. Run EPACTS association pipeline == |
| | | |
Line 265: |
Line 264: |
| *Score test: 32 seconds | | *Score test: 32 seconds |
| | | |
− | <br> Hence, we only ask for Firth test results for SNPs with MAC <= 200. | + | <br> Hence, we only ask for Firth test results for SNPs with MAC <= 200. |
| | | |
| === 1. Typical DIAGRAM analysis using existing association pipeline<br> === | | === 1. Typical DIAGRAM analysis using existing association pipeline<br> === |
Line 279: |
Line 278: |
| -test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -field EC -run 10 | | -test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200 -field EC -run 10 |
| </pre> | | </pre> |
− | <br>'''Important:''' To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!<br> | + | <br>'''Important:''' To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!<br> |
| | | |
| === 3. Analysis of chromosome 20 using logistic regression score test === | | === 3. Analysis of chromosome 20 using logistic regression score test === |
Line 289: |
Line 288: |
| -test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10 | | -test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10 |
| </pre> | | </pre> |
− | This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10). | + | This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10). |
| | | |
| == 5. Report EPACTS results<br> == | | == 5. Report EPACTS results<br> == |
Line 326: |
Line 325: |
| *PVALUE : P-value | | *PVALUE : P-value |
| | | |
− | The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies. | + | The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies. |
| + | |
| + | |
| + | |
| + | = Troubleshooting Common Issues = |
| + | |
| + | == ERROR: No overlapping IDs between VCF and PED file. Cannot proceed. == |
| + | |
| + | Check that your individual ID's in your PED file are the same as those in your VCF file. |
| + | |
| + | For example, if your VCF individual ID's include the family ID's (i.e. ABCD->ABCD001), the individual ID's in the PED file must match it exactly. |
| + | |
| + | |
| + | |
| + | == Estimated allele frequencies and analysis results do not exactly match results from my existing association software == |
| + | |
| + | Check that you have included the same set of covariates (with categorical variables encoded as dummy variables). |
| + | |
| + | Check that you have the same number of cases and controls analyzed. |
| + | |
| + | FInally, check that you used dosages by adding the appropriate "-field" option. For example, suppose your VCF is: |
| + | |
| + | <br> |
| + | <pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A001 B001 C001 |
| + | 11 180567 11:180567 C G 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 |
| + | 11 186458 11:186458 G A 0 PASS . GT:EC 1/1:1.9850 1/1:1.9750 1/1:1.9840 |
| + | 11 186462 11:186462 C A 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 |
| + | 11 192958 11:192958 G T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 |
| + | 11 192995 11:192995 C T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000 |
| + | 11 193065 11:193065 G A 0 PASS . GT:EC 1/1:1.9980 1/1:1.9990 1/1:1.9960 |
| + | 11 193096 11:193096 C T 0 PASS . GT:EC 0/1:0.7840 0/1:0.6280 1/1:1.6550 |
| + | 11 193146 11:193146 G A 0 PASS . GT:EC 1/1:1.8550 1/1:1.8460 1/1:1.7940 |
| + | |
| + | </pre> |
| + | |
| + | |
| + | The genotype information has FORMAT "GT:EC". For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000. To access the dosages, you must specify the option "-field EC" |