RAREMETALWORKER METHOD

Brief Introduction

RAREMETALWORKER generates single variant association test statistics for a single study prior to meta-analysis. This page provides a brief description of the statistics that RAREMETALWORKER calculates, together with key formulae.

Key Statistics for Analysis of Single Study

NOTATIONS

We use the following notations to describe our methods:

${\displaystyle \mathbf {y} }$ is the vector of observed quantitative trait

${\displaystyle \mathbf {X} }$ is the design matrix

${\displaystyle \mathbf {G_{i}} }$ is the genotype vector of the ${\displaystyle i^{th}}$ variant

${\displaystyle {\bar {\mathbf {G_{i}} }}}$ is the vector of average genotype of the ${\displaystyle i^{th}}$ variant

${\displaystyle {\boldsymbol {\beta _{c}}}}$ is the vector of covariate effects

${\displaystyle \beta _{i}}$ is the scalar of fixed genetic effect of the ${\displaystyle i^{th}}$ variant

${\displaystyle \mathbf {g} }$ is the random genetic effects

${\displaystyle {\boldsymbol {\varepsilon }}}$ is the non-shared environmental effects

${\displaystyle {\hat {\boldsymbol {\Omega }}}}$ is the estimated covariance matrix of ${\displaystyle \mathbf {y} }$

${\displaystyle \mathbf {K} }$ is the kinship matrix

${\displaystyle \sigma _{g}^{2}}$ is the genetic component

${\displaystyle {{\sigma _{g}}_{X}}^{2}}$ is the genetic component for markers on chromosome X

${\displaystyle \sigma _{e}^{2}}$ is the non-shared-environment component.

SINGLE VARIANT SCORE TEST

We used the following model for the trait:

${\displaystyle \mathbf {y} =\mathbf {X} {\boldsymbol {\beta _{c}}}+\beta _{i}(\mathbf {G_{i}} -{\bar {\mathbf {G_{i}} }})+\mathbf {g} +{\boldsymbol {\varepsilon }}}$.

Here, [explain the formula].

In this model, ${\displaystyle \beta _{i}}$ is to measure the additive genetic effect of the ${\displaystyle i^{th}}$ variant. As usual, the score statistic for testing ${\displaystyle H_{0}:\beta _{i}=0}$ is:

${\displaystyle U_{i}=(\mathbf {G_{i}} -\mathbf {\bar {G_{i}}} )^{T}{\hat {\boldsymbol {\Omega }}}^{-1}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})}$

We further derive the variance-covariance matrix of these statistics as

${\displaystyle \mathbf {V} =(\mathbf {G} -{\bar {\mathbf {G} }})^{T}({\hat {\boldsymbol {\Omega }}}^{-1}-{\hat {\boldsymbol {\Omega }}}^{-1}\mathbf {X} (\mathbf {X^{T}} {\hat {\boldsymbol {\Omega }}}^{-1}\mathbf {X} )^{-1}\mathbf {X^{T}} {\hat {\boldsymbol {\Omega }}}^{-1})(\mathbf {G} -{\bar {\mathbf {G} }})}$.

The score test statistic, ${\displaystyle T_{i}=(U_{i}^{2})/V_{ii}}$, is asymptotically distributed as chi-squared with one degree of freedom. The score test p-value is reported in RAREMETALWORKER.

SUMMARY STATISTICS AND COVARIANCE MATRICES

RAREMETALWORKER automatically stores the score statistics for each marker ( ${\displaystyle U_{i}}$) together with quality information of that marker, including HWE p-value, call rate, and allele counts.

RAREMETALWORKER also stores the covariance matrices (${\displaystyle \mathbf {V} }$) of the score statistics of markers within a window, size of which can be specified through command line.

MODELING RELATEDNESS

We use a variance component model to handle familial relationships. We estimate the variance components under the null model:

${\displaystyle \mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+\mathbf {g} +{\boldsymbol {\varepsilon }}}$

We assume that genetic effects are normally distributed, with mean ${\displaystyle \mathbf {0} }$ and covariance ${\displaystyle \mathbf {K} \sigma _{g}^{2}}$ where the matrix ${\displaystyle \mathbf {K} }$ summarizes kinship coefficients between sampled individuals and ${\displaystyle \sigma _{g}^{2}}$ is a positive scalar describing the genetic contribution to the overall variance. We assume that non-shared environmental effects are normally distributed with mean ${\displaystyle \mathbf {0} }$ and covariance ${\displaystyle \mathbf {I} \sigma _{e}^{2}}$, where ${\displaystyle \mathbf {I} }$ is the identity matrix.

To estimate ${\displaystyle \mathbf {K} }$, we either use known pedigree structure to define ${\displaystyle \mathbf {K} }$ or else use the empirical estimator

${\displaystyle \mathbf {K} ={\frac {1}{l}}\sum _{i=1}^{l}{(G_{i}-2f_{i}\mathbf {1} )(G_{i}-2f_{i}\mathbf {1} ) \over 4f_{i}(1-f_{i})}}$,

where ${\displaystyle l}$ is the count of variants, ${\displaystyle G_{i}}$ and ${\displaystyle f_{i}}$ are the genotype vector and estimated allele frequency for the ${\displaystyle i^{th}}$ variant, respectively. Each element in ${\displaystyle G_{i}}$ encodes the minor allele count for one individual. Model parameters ${\displaystyle {\hat {\boldsymbol {\beta }}}}$, ${\displaystyle {\hat {\sigma _{g}^{2}}}}$ and ${\displaystyle {\hat {\sigma _{e}^{2}}}}$, are estimated using maximum likelihood and the efficient algorithm described in Lippert et. al. For convenience, let the estimated covariance matrix of ${\displaystyle \mathbf {y} }$ be ${\displaystyle {\hat {\boldsymbol {\Omega }}}={\hat {\sigma _{g}^{2}}}\mathbf {K} +{\hat {\sigma _{e}^{2}}}\mathbf {I} }$.

Chromosome X

To analyze markers on chromosome X, we fit an extra variance components ${\displaystyle {{\sigma _{g}}_{X}}^{2}}$, to model the variance explained by chromosome X. A kinship for chromosome X, ${\displaystyle {\boldsymbol {K_{X}}}}$, can be estimated either from a pedigree, or from genotypes of marker from chromosome X. Then the estimated covariance matrix can be written as ${\displaystyle {\hat {\boldsymbol {\Omega }}}={\hat {\sigma _{g}^{2}}}\mathbf {K} +{\hat {{\sigma _{g}}_{X}^{2}}}\mathbf {K_{X}} +{\hat {\sigma _{e}^{2}}}\mathbf {I} }$.