Bar Harbor Statistical Genetics Workshop

From Genome Analysis Wiki
Jump to: navigation, search

This workshop was prepared for the 51st Mammalian Genetics Short Course, at the Jackson Laboratory in Bar Harbor. It illustrates several simple analyses of genetic association studies (we are necessarily limited by time!).

Glucose Data Set

This example examines evidence for association between fasting glucose levels and genetic markers in the G6PC2 (chr. 2), GCK (chr. 7) and MTNR1B (chr. 11) regions. It uses results from 3 genomewide association studies: FUSION, SardiNIA and DGI. Genetic variants in the three loci impact fasting glucose levels and, in the case of MTRN1B, also impact the risk of type 2 diabetes.

We will examine evidence for association in three short regions that show evidence for association with glucose levels (Chen et al, Journal of Clinical Investigation, 2008; Prokopenko et al, Nature Genetics, 2009). These regions map near theG6PC2 (chr. 2), GCK (chr. 7) and MTNR1B (chr. 11) genes. Results for markers within several hundred kilobases of these regions were genotyped or imputed in the FUSION (Scott et al, Science, 2007), DGI (Diabetes Genetics Initiative, Science, 2007), and SardiNIA (Scuteri et al, PLoS Genetics, 2007) genomewide association studies and are tabulated in the three files below:


For convenience, a zip archive that includes all three files is also available.

Generating a Q-Q Plot


One of the most common analysis in a genomewide association is to generate a Q-Q plot which compares observed test statistics with those expected to occur by chance when a similar number of markers is examined. The approach works because in a GWAS we expect that the vast majority of variants will show no evidence of association with the trait of interest. For this exercise, you should pick one of the three datasets above and load into an appropriate statistical package (like R). If no suitable statistical package is available, Microsoft Excel will do.

Your Tasks

Here are the basic steps you will need to carry out to generate a Q-Q plot:

  • Count the number of markers being examined. Let's call this number M.
  • Sort observed p-values, from smallest to largest.
  • Match each observed p-value with an expected p-value, which is simply its rank divided by M+1.
  • Because we are most interested in the smallest p-value, transform observed and expected p-values using the -log() function.
  • Plot expected p-values along the X axis and actual p-values along the X-axis.

Interpret Results

Compare your generated Q-Q plot with the three examplar plots below:


This first plot represents a null setting, where test statistics appear well behaved, but there is no evidence for association.


This second plot represents a problematic situation, where it is very clear that test statistics don't match null expectations.


This final plot represents an ideal situation, where the majority of markers reassuringly fit null expectations and a small number of markers (perhaps tagging elusive loci impacting the trait of interest) show evidence for association.

Specific Questions to Consider

  • Which of the QQ plots above best matches actual data from the study you picked?
  • Can you propose possible explanations for the patterns you observe in the QQ plot?

Interpreting Regional Association Results

Web Resources

Another common task that geneticists encounter is the interpretation of study results in the context of nearby genes and other variants. There are many useful resources to help interpretation of study results, including:

To learn about genes that flank variants with tentative evidence of association in each of the three regions we are investigating, select the SNP showing the strongest evidence of association in each region and explore what you can learn from each of the resources above.

Plotting Regional Association Results

The ability to display association study results in a visual manner can also be extremely useful. There are now several automated tools that facilitate the process of generating high quality visual displays of association study results and regional linkage disequilibrium patterns, including SNAP [1], CandiSNPer [2] and LocusZoom [3].

Here, we will use LocusZoom to generate displays of association study results.

To do this, you should:

  1. Go to the LocusZoom website at and select Plot Using Your Data
  2. Upload your data (using the Browse ... button)
  3. Provide key descriptors in your data, including header information for:
    1. P-value column (should be PVALUE)
    2. Marker name column (should be SNP)
    3. Column delimiter (should be WhiteSpace)
  4. Select a region to plot (for example, you could request plots in the regions surrounding G6PC2, MTNR1B or GCK)

Questions to Consider

  • What did you learn from each of the web resources?
  • What are the limitations of using web resources to study overlap between results?
  • Do the boundaries of the association signal match the recombination map in each region?
  • Do the association signals clearly point to one gene?

Are the Glucose Associated Variants Also Associated with Other Traits?

If you have gotten this far and don't feel like twiddling your thumbs, you could proceed to investigate another question of interest. Specifically, do you have evidence that these same variants / loci are associated with other traits?

One place to start would be by examining the results of genomewide association studies for related traits, such as available in LocusZoom or the NHGRI GWAS Catalog.