Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,671 bytes added ,  14:26, 17 October 2012
no edit summary
Line 1: Line 1:  
= Motivation and Rationale  =
 
= Motivation and Rationale  =
   −
[[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data.  The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium.  In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively.  We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression [http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf (Firth, 1993)].  For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power.  For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies.
+
[[EPACTS|EPACTS]] is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data.  The main motivation for using EPACTS is to use a consistent analysis framework for association analysis in the DIAGRAM consortium.  In addition, for analysis of low frequency variants (minor allele frequency [MAF] < 5%), standard logistic regression Wald or likelihood ratio tests found in existing association software are conservative or anti-conservative respectively.  We implemented two statistical tests we will use for for analysis of low frequency variants: (1) logistic regresion-based score test and (2) Firth bias-corrected logistic regression [http://www.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf (Firth, 1993)].  For analysis of common variants, any asyptotic logistic regression test has well-controlled type I error rates and asymptotically equivalent power.  For simplicity and consistency, we propose the use of both score and Firth tests for testing all allele frequencies.  
    
= Outline of analysis protocol  =
 
= Outline of analysis protocol  =
Line 126: Line 126:  
M QT
 
M QT
 
M AGE
 
M AGE
</pre>
+
</pre>  
 
   
== 4. &nbsp;Run EPACTS association pipeline  ==
 
== 4. &nbsp;Run EPACTS association pipeline  ==
   Line 265: Line 264:  
*Score test: &nbsp;32 seconds
 
*Score test: &nbsp;32 seconds
   −
<br> Hence, we only ask for Firth test results for SNPs with MAC &lt;= 200.
+
<br> Hence, we only ask for Firth test results for SNPs with MAC &lt;= 200.  
    
=== 1. Typical DIAGRAM analysis using existing association pipeline<br>  ===
 
=== 1. Typical DIAGRAM analysis using existing association pipeline<br>  ===
Line 279: Line 278:  
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200  -field EC -run 10
 
-test b.firth -pheno DISEASE -cov AGE -sepchr -anno -min-mac 1 -max-mac 200  -field EC -run 10
 
</pre>  
 
</pre>  
<br>'''Important:''' &nbsp;To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. &nbsp;Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!<br>
+
<br>'''Important:''' &nbsp;To analyze dosages (not genotypes), you must specify the dosage field with the "--field EC" option. &nbsp;Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!<br>  
    
=== 3. Analysis of chromosome 20 using logistic regression score test  ===
 
=== 3. Analysis of chromosome 20 using logistic regression score test  ===
Line 289: Line 288:  
-test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10
 
-test b.score -pheno DISEASE -cov AGE -chr 20 -anno -min-mac 1 -field EC -run 10
 
</pre>  
 
</pre>  
This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).
+
This command will run single variant analysis using the score test logistic regression on the DISEASE phenotype adjusting for AGE. Add the relevant additional covariates with additional "-cov" options. This assumes that the VCF files are separated by chromosomes (option -sepchr). All variants with at least one minor allele count will be analyzed (option -min-mac 1). It will annotate results by functional category (option -anno) and run the analysis on 10 parallel CPUs (option -run 10).  
    
== 5. &nbsp;Report EPACTS results<br>  ==
 
== 5. &nbsp;Report EPACTS results<br>  ==
Line 326: Line 325:  
*PVALUE&nbsp;: P-value
 
*PVALUE&nbsp;: P-value
   −
The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies.
+
The rest of columns varies by statistical tests. For example, in b.score test, SCORE represents score test statistics, N.CASE and N.CTRL represents the case/control counts, and AF.CASE and AF.CTRL represents the case/control allele frequencies.  
 +
 
 +
 
 +
 
 +
= Troubleshooting Common Issues =
 +
 
 +
== ERROR: No overlapping IDs between VCF and PED file. Cannot proceed. ==
 +
 
 +
Check that your individual ID's in your PED&nbsp;file are the same as those in your VCF file.
 +
 
 +
For example, if your VCF individual ID's include the family ID's (i.e.&nbsp;ABCD-&gt;ABCD001), the individual ID's in the PED file must match it exactly.
 +
 
 +
 
 +
 
 +
== Estimated allele frequencies and analysis results do not exactly match results from my existing association software ==
 +
 
 +
Check that you have included the same set of covariates (with categorical variables encoded as dummy variables).
 +
 
 +
Check that you have the same number of cases and controls analyzed.
 +
 
 +
FInally, check that you used dosages by adding the appropriate "-field" option. &nbsp;For example, suppose your VCF is:
 +
 
 +
<br>
 +
<pre>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A001 B001 C001
 +
11 180567 11:180567 C G 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
 +
11 186458 11:186458 G A 0 PASS . GT:EC 1/1:1.9850 1/1:1.9750 1/1:1.9840
 +
11 186462 11:186462 C A 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
 +
11 192958 11:192958 G T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
 +
11 192995 11:192995 C T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
 +
11 193065 11:193065 G A 0 PASS . GT:EC 1/1:1.9980 1/1:1.9990 1/1:1.9960
 +
11 193096 11:193096 C T 0 PASS . GT:EC 0/1:0.7840 0/1:0.6280 1/1:1.6550
 +
11 193146 11:193146 G A 0 PASS . GT:EC 1/1:1.8550 1/1:1.8460 1/1:1.7940
 +
 
 +
</pre>
 +
 
 +
 
 +
The genotype information has FORMAT "GT:EC". &nbsp;For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000. &nbsp;To access the dosages, you must specify the option "-field EC"
216

edits

Navigation menu