Bar Harbor Statistical Genetics Workshop
From Genome Analysis Wiki
This workshop was prepared for the 51st Mammalian Genetics Short Course, at the Jackson Laboratory in Bar Harbor. It illustrates several simple analyses of genetic association studies (we are necessarily limited by time!).
Glucose Data Set
This example examines evidence for association between fasting glucose levels and genetic markers in the G6PC2 (chr. 2), GCK (chr. 7) and MTNR1B (chr. 11) regions. It uses results from 3 genomewide association studies: FUSION, SardiNIA and DGI. Genetic variants in the three loci impact fasting glucose levels and, in the case of MTRN1B, also impact the risk of type 2 diabetes.
We will examine evidence for association in three short regions that show evidence for association with glucose levels (Chen et al, Journal of Clinical Investigation, 2008; Prokopenko et al, Nature Genetics, 2009). These regions map near theG6PC2 (chr. 2), GCK (chr. 7) and MTNR1B (chr. 11) genes. Results for markers within several hundred kilobases of these regions were genotyped or imputed in the FUSION (Scott et al, Science, 2007), DGI (Diabetes Genetics Initiative, Science, 2007), and SardiNIA (Scuteri et al, PLoS Genetics, 2007) genomewide association studies and are tabulated in the three files below:
For convenience, a zip archive that includes all three files is also available.
Generating a Q-Q Plot
One of the most common analysis in a genomewide association is to generate a Q-Q plot which compares observed test statistics with those expected to occur by chance when a similar number of markers is examined. The approach works because in a GWAS we expect that the vast majority of variants will show no evidence of association with the trait of interest. For this exercise, you should pick one of the three datasets above and load into an appropriate statistical package (like R). If no suitable statistical package is available, Microsoft Excel will do.
Here are the basic steps you will need to carry out to generate a Q-Q plot:
- Count the number of markers being examined. Let's call this number M.
- Sort observed p-values, from smallest to largest.
- Match each observed p-value with an expected p-value, which is simply its rank divided by M+1.
- Because we are most interested in the smallest p-value, transform observed and expected p-values using the -log() function.
- Plot expected p-values along the X axis and actual p-values along the X-axis.
Compare your generated Q-Q plot with the three examplar plots below:
This first plot represents a null setting, where test statistics appear well behaved, but there is no evidence for association.
This second plot represents a problematic situation, where it is very clear that test statistics don't match null expectations.
This final plot represents an ideal situation, where the majority of markers reassuringly fit null expectations and a small number of markers (perhaps tagging elusive loci impacting the trait of interest) show evidence for association.
Specific Questions to Consider
- Which of the QQ plots above best matches actual data from the study you picked?
- Can you propose possible explanations for the patterns you observe in the QQ plot?
Interpreting Regional Association Results
Another common task that geneticists encounter is the interpretation of study results in the context of nearby genes and other variants. There are many useful resources to help interpretation of study results, including:
- the NHGRI catalog of genomewide association study results at http://genome.gov/gwastudies
- the UCSC genome browser at http://genome.ucsc.edu
- the NCBI database collection at http://www.ncbi.nlm.nih.gov
- the swiss army knive of the early 21st century at http://www.google.com
To learn about genes that flank variants with tentative evidence of association in each of the three regions we are investigating, select the SNP showing the strongest evidence of association in each region and explore what you can learn from each of the resources above.
Plotting Regional Association Results
The ability to display association study results in a visual manner can also be extremely useful. There are now several automated tools that facilitate the process of generating high quality visual displays of association study results and regional linkage disequilibrium patterns, including SNAP , CandiSNPer  and LocusZoom .
Here, we will use LocusZoom to generate displays of association study results.
To do this, you should:
- Go to the LocusZoom website at http://csg.sph.umich.edu/locuszoom/ and select Plot Using Your Data
- Upload your data (using the Browse ... button)
- Provide key descriptors in your data, including header information for:
- P-value column (should be PVALUE)
- Marker name column (should be SNP)
- Column delimiter (should be WhiteSpace)
- Select a region to plot (for example, you could request plots in the regions surrounding G6PC2, MTNR1B or GCK)
Questions to Consider
- What did you learn from each of the web resources?
- What are the limitations of using web resources to study overlap between results?
- Do the boundaries of the association signal match the recombination map in each region?
- Do the association signals clearly point to one gene?
Are the Glucose Associated Variants Also Associated with Other Traits?
If you have gotten this far and don't feel like twiddling your thumbs, you could proceed to investigate another question of interest. Specifically, do you have evidence that these same variants / loci are associated with other traits?