Difference between revisions of "Bar Harbor Statistical Genetics Workshop"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 11: Line 11:
 
  [http://www.sph.umich.edu/csg/abecasis/downloads/SARDINIA_three_regions.txt SARDINIA_three_regions.txt]
 
  [http://www.sph.umich.edu/csg/abecasis/downloads/SARDINIA_three_regions.txt SARDINIA_three_regions.txt]
  
For convenience, a  [http://www.sph.umich.edu/csg/abecasis/downloads/Three_regions.zip zip archive]
+
For convenience, a  [http://www.sph.umich.edu/csg/abecasis/downloads/ThreeRegions.zip zip archive]
that includes all three files is also availabe.
+
that includes all three files is also available.
 +
 
 +
=== Generating a Q-Q Plot ===
 +
 
 +
One of the most common analysis in a genomewide association is to generate a Q-Q plot which compares observed test statistics with those expected to occur by chance when a similar number of markers is examined. For this exercise, you should pick one of the three datasets above and load into an appropriate statistical package (like R). If no suitable statistical package is available, Microsoft Excel will do.
 +
 
 +
Here are the basic steps you will need to carry out to generate a Q-Q plot:
 +
 
 +
* Count the number of markers being examined. Let's call this number M.
 +
* Sort observed p-values, from smallest to largest.
 +
* Match each expected p-value with an expected p-value, which is simply its rank divided by M.
 +
* Plot expected p-values along the X axis and actual p-values along the X-axis.
 +
 
 +
Compare your generated Q-Q plot with the three examplar plots below:
 +
 
 +
[Image:QQ-plot-null.png]
 +
[Image:QQ-plot-problem.png]
 +
[Image:QQ-plot-ideal.png]

Revision as of 09:11, 22 July 2010

Summary

This workshop was prepared for the 51st Mammalian Genetics Short Course, at the Jackson Laboratory in Bar Harbor. It illustrates several simple analyses of genetic association studies (we are necessarily limited by time!).

Example Data Set

We will examine evidence for association in three short regions that show evidence for association with glucose levels (Chen et al, Journal of Clinical Investigation, 2008; Prokopenko et al, Nature Genetics, 2009). These regions map near the MTNR1B, GCK and G6PC2 genes. Results for markers within several hundred kilobases of these regions were genotyped or imputed in the FUSION (Scott et al, Science, 2007), DGI (Diabetes Genetics Initiative, Science, 2007), and SardiNIA (Scuteri et al, PLoS Genetics, 2007) genomewide association studies and are tabulated in the three files below:

DGI_three_regions.txt
FUSION_three_regions.txt
SARDINIA_three_regions.txt

For convenience, a zip archive that includes all three files is also available.

Generating a Q-Q Plot

One of the most common analysis in a genomewide association is to generate a Q-Q plot which compares observed test statistics with those expected to occur by chance when a similar number of markers is examined. For this exercise, you should pick one of the three datasets above and load into an appropriate statistical package (like R). If no suitable statistical package is available, Microsoft Excel will do.

Here are the basic steps you will need to carry out to generate a Q-Q plot:

  • Count the number of markers being examined. Let's call this number M.
  • Sort observed p-values, from smallest to largest.
  • Match each expected p-value with an expected p-value, which is simply its rank divided by M.
  • Plot expected p-values along the X axis and actual p-values along the X-axis.

Compare your generated Q-Q plot with the three examplar plots below:

[Image:QQ-plot-null.png] [Image:QQ-plot-problem.png] [Image:QQ-plot-ideal.png]