EMADS Primary Analysis Plan

From Genome Analysis Wiki
Jump to: navigation, search

This Page is No Longer Supported. Please Visit http://gscan.sph.umich.edu
EMADS thumb5.png

Exome Meta-Analysis of Drinking and Smoking (EMADS) Analysis Plan

Parent page: EMADS


All samples have some version of the Exome Chip or exome/whole genome sequences. Individual studies will provide information about the manufacturer and version of the exome chip, or sequencing platform, they are using.

Inclusion Criteria

For our first analysis, samples must be between ages 18 and 70 (inclusive) and be of European ancestry. We will extend analysis to other ancestral groups in the future.

Quality Control

We leave calling algorithms, marker filters, and sample filters to the discretion of local sites, although we will evaluate the possibility of batch effects (where batch might be a study) during the meta-analysis step.

For reference, four currently participating studies have used Illumina chips and Illumina’s genotype caller in Genome Studio (Gencall). Some studies also implemented some manual curation involving reclustering the intensity data of ~1500 markers.

Strand Orientation

Chip TOP allele annotations (typical output from Gencall) need to be updated to the forward strand of build 37.

The strand file for exome chip version 12v1_A is available at: http://www.well.ox.ac.uk/~wrayner/strand/HumanExome-12v1_A-b37-strand.zip

Usage instructions, including scripts, are available here: http://www.well.ox.ac.uk/~wrayner/strand/

Future strand files will also be available at that site.


(1) Average cigarettes smoked per day, either as a current smoker or former smoker

Individuals who either never smoked, or on whom we have no data (e.g., someone was a former smoker but former smoking was never assessed) will be excluded from analysis. Only cigarettes will be included in the estimate. If preferable, repeated measures designs (longitudinal data) can use all assessments by scaling and correcting for covariates within waves of assessment, then averaging across assessments.

For studies that collect a quantitative measure of CPD, where the respondent is free to provide any integer (e.g., 13 CPD), we will bin responses into the following bins: 1-10, 11-20, 21-30, 31+. If some study collected binned responses from the outset, and those bins happen to differ from ours (e.g., 1-5, 6-15, etc.), then we will simply use whatever bins the study has collected. Please contact Scott if your study does something completely different.

In analysis, it is likely easiest to consider the bins to correspond to the following numerical values.

  • 1 = 1-10
  • 2 = 11-20
  • 3 = 21-30
  • 4 = 31+

Please note, however, that when we report descriptive statistics about our phenotypes we will want to report the original participant responses. Even though we'll bin the data for analysis, we'll still report quantitative CPD (when possible) when we describe each study's phenotype in eventual publications.

(2) Smoking Initiation

This is a binary phenotype. Code "1" for everyone in the study who reports ever being a regular smoker in their life (current or former). Code a "0" for everyone who denies ever being a regular smoker in their life.

Every study had some usable measure of whether a respondent has ever regularly smoked. Almost all asked directly. Some have necessary information to code this variable (e.g., 100 cigs lifetime? Ever smoked every day for 2 weeks straight?).

Note that we’re among the first groups conducting such meta-analyses, and our analysis pipeline is currently restricted to continuous traits. Until methods are developed for binary traits, it is proposed that we analyze smoking initiation as a continuous trait.

(3) Pack Years

Number of cigarettes per day, divided by 20, then multiplied by the number of years the person has smoked. For this measure please use the quantitative CPD, and not the binned responses discussed above under the CPD heading. If your study collected binned responses from the outset, please use the midpoint of the range in calculating Pack Years. For example, individuals stating they smoked 11-20 CPD would be assumed to have smoked 15.5 on average

(4) Age of Initiation of Smoking

The age an individual first became a regular smoker. Please check for obvious outliers and remove them (4 years old or younger).

(5) Average drinks per week, either as a current drinker or former drinker

The average number of drinks a subject reports drinking each week. Most studies asked this question directly. Other studies have converted to grams per day, or grams per week. The latter are fine to analyze directly for our purposes.

Individuals who either never drank, or on whom we have no data (e.g., someone was a former drinker but former drinking was not assessed) will be excluded from analysis. Please combine all types of liquor in the total estimate. If preferable, repeated measures designs (longitudinal data) can use all assessments by scaling and correcting for covariates within waves of assessment, then averaging across assessments.

If your study forced the respondent to report ranges (e.g., 1-5, 6-10, 11-15, 16-20, etc.) please simply use the midpoint of the range. For example, if one range is 1-5 DPW, we assume they drink 2.5 DPW on average. Then use these midpoints in all subsequent analysis.

Covariate Correction (to be done after left-anchoring and log transformation)

For CPD we will consider the binned responses to be on a quantitative scale from 1-4 (see above under the CPD phenotype description). CPD therefore will not require transformation prior to covariate correction.

For the other four quantitative phenotypes (Pack Years, Age of Initiation, Drinks Per Week) please left-anchor the distribution at 1 and log-transform it. Left-anchoring, such that no value is less than 1, prevents the log-transform from returning nonsensical values like negative infinity. Then apply the covariate correction to the transformed phenotypes.

No transformations are necessary for the binary smoking initiation phenotype, but we will still correct for covariates for smoking initiation (recall that we are treating this binary phenotype in our analysis as if it were a continuous trait).

Appropriate covariates can often be study-specific. We will depend on local investigators to determine the most appropriate covariates. We list here some covariates that will likely be necessary.

Main Effects

  • Age
    • At assessment in current smokers/drinkers
    • Age of smoking/drinking for former smokers/drinkers could be age at quitting
    • At assessment for Pack Years, Smoking Initiation, and Age of Initiation, regardless of current/former smoking status
  • Age squared
  • Sex
  • Date of birth (or year, or range)
  • Cohort
  • Genetic principle components (alternatively could use empirical kinships in rare-metal-worker)
  • Adolescence versus adulthood (e.g., < 21 years of age versus >=21). Only consider using this covariate if you have a large number of adolescents in your study.
  • Date of assessment (e.g., the calendar year of the assessment)?
  • Current versus former smoker for smoking phenotypes. This would be a binary covariate.
  • Current versus former drinker for drinking phenotypes. This would be a binary covariate.
  • For the drinking phenotype, consider Height, weight, and/or BMI (the idea is that a similar amount of alcohol has different effects on a 200 lb person versus a 100 lb person)


These covariates may not be necessary, but we list them for local analysts to consider.

  • Sex X Adolescence interaction
  • Sex X Age interaction
  • Sex X Weight/Height/BMI interaction
  • Age X Adolescence interaction

Analysis of Covariate-Corrected Phenotypes

The basic analysis is two-stage. In the first stage, local investigators produce, for each phenotype, a set of single-variant summary statistics using a tool developed at the University of Michigan. In the second stage, these summary statistics are pooled for meta-analysis. All single-variant and gene-based (‘burden’) tests can be conducted from the summary statistics.

These two stages are now described in more detail.

Stage 1: Local Sites Produce Summary Statistics Using Rare-Metal-Worker

The meta-analysis step (stage 2) requires a very specific set of summary statistics, which includes single-variant test statistics and p-values, as well as the test statistic covariance matrix within a sliding window (default: 1Mb). Shuang Feng, Dajiang Liu, and Goncalo Abecasis at the University of Michigan have developed software specifically for this purpose, called Rare-Metal-Worker. Software and usage instructions to generate necessary single variant statistics is available at Rare-Metal-Worker.If there are installation problems please let Scott know.

Rare-Metal-Worker works best, IMHO, when coding the genotype files as VCF. There are several ways to convert to vcf, including PLINK/SEQ and also WDIST (https://www.cog-genomics.org/wdist/).

NOTE: It is essential that analysis proceeds in the following order. For CPD, please bin quantitative responses and correct for covariates to obtain residuals. For Pack Years, Age of Initiation, and Drinks Per Week, please left-anchor responses at 1, log-transform, and then correct for covariates to obtain residuals. In this way we will obtain residualized phenotypes ready for analysis with Rare-Metal-Worker. These steps are probably easier to do in your software of choice.

Now, using the residualized phenotypes in a .ped file please specify the --makeResiduals and --inverseNormalize options. These will correct for the intercept and then inverse-normalize the phenotype prior to conducting association tests.

Marker Grid for Fast-LMM Empirical Kinship

If you plan to use the Fast-LMM mixed model capability in Rare-Metal-Worker, it is likely preferable that you construct your kinship matrix either 1) with genome-wide markers from a GWAS panel (or 2nd generation exome chip) or 2) a subset of selected markers from the exome chip array. A list of markers can be obtained from Scott. There are many common markers on the first version of the exome chip, and many were selected for fine mapping (of MHC) or because of prior GWAS signals. These latter markers would ideally be excluded from the set of markers used to construct the empirical kinship matrix.

Running Times

Run times depend heavily on the type of analysis. If all samples are unrelated, and no kinship matrix is used, then run times should be relatively fast (tens of minutes). If a mixed model is used, for example using an empirical kinship, then in samples of a few thousand rare-metal-worker should take less than 20 minutes to complete. In larger samples, especially of related individuals (~10,000 or more with phenotype data), it can take several days to complete an exome-chip-wide scan.

Submitting Results for Meta-Analysis

All output files from Rare-Metal-Worker can then be uploaded to an sftp server at the University of Michigan for central analysis -- please email Scott Vrieze for the hostname, username, and password. One site used Aspera to transmit results, which worked well.

Stage 2: Single-Variant and Gene-Based Meta-Analysis

Single-Variant Tests We will do meta-analysis of score statistics for individual variants weighting by sample size using Rare-Metal. Details are provided at that site.

Gene-Based Tests

Gene-based tests can be conducted centrally by Scott using output from Rare-Metal-Worker.

We will implement three burden tests.

  1. First, a Variable Threshold Combined Multivariate and Collapsing count method (VTCMC), where the number of rare alleles is counted in each gene, then the gene is tested for association. The threshold for what variants are considered "rare" (MAF < .05? MAF < .01?) is set adaptively such that the result minimizes the p-value obtained.
  2. Second, we will use SKAT (SKAT) for all rare variants (MAF < .05) within a gene. SKAT allows for variants with opposite directions of effect within the same gene, whereas the variable threshold combined multivariate and collapsing method does not.
  3. Third, we will use a burden test developed by Madsen and Browning (M-B) where the number of rare alleles is counted in each gene, then the gene is tested for association, but alleles in the count are weighted by the inverse of the MAF. Thus rarer alleles are given more weight than common allele.

Genotype Annotation

Gene-based burden tests can be augmented with genotype annotation. We currently plan to use only nonsynonymous variants from ANNO-generated annotations relative to GENCODE transcripts. All annotation can be done centrally at the meta-analysis stage to ensure consistency across sites.

Multivariate Test

We will pursue development of a multivariate test for drinking and smoking jointly. This could be as simple as, on a per-marker or per-gene basis, averaging effect sizes or p-values for meta-analytic CPD and DPW p-value results.

Further Downstream Analysis

To be determined. Will depend on results from the main analysis above.

We more than welcome individual sites to propose additional analysis, as well as to take the lead on additional projects related to the primary aims of this meta-analysis.

Descriptive Phenotype Information

When it comes time to publish our results we'll need descriptive information about our phenotypes. In anticipation of this Scott has sent around some draft tables. The tables will contain descriptive information about your study and phenotypes. For each phenotype we need:

  • sample size of non-missing observations
  • mean, standard deviation, range for quantitative phenotypes (including quantitative CPD, before binning)
  • Counts for smoking initiation, a binary phenotype
  • The 5x5 correlation matrix between residualized phenotypes, as well as the sample size contributing to each correlation.