Difference between revisions of "EMMAX"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with '= EMMAX Overview = '''EMMAX''' is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computati…')
 
 
(21 intermediate revisions by one other user not shown)
Line 1: Line 1:
= EMMAX Overview =
+
== EMMAX Overview ==
 
'''EMMAX''' is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.
 
'''EMMAX''' is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.
  
= Download EMMAX =
+
== Download EMMAX ==
The latest release of EMMAX can be downloaded at [http://www.sph.umich.edu/csg/kang/emmax/download/index.html EMMAX Download Page]
+
The latest release of EMMAX can be downloaded at [http://csg.sph.umich.edu//kang/emmax/download/index.html EMMAX Download Page]
  
= Key Instructions =
+
== Key Instructions ==
  
 
The instruction is based on latest INTEL binary of EMMAX release. See http://genetics.cs.ucla.edu/emmax/ for the documentation of the original version of the binary
 
The instruction is based on latest INTEL binary of EMMAX release. See http://genetics.cs.ucla.edu/emmax/ for the documentation of the original version of the binary
  
== Preparing Input Genotype Files ==
+
=== Preparing Input Genotype Files ===
 
Use PLINK software to transpose your genotype files (bed or ped format) to tped/tfam format by running % plink --bfile [bed_prefix] (or --file [ped_prefix]) --recode12 --output-missing-genotype 0 --transpose --out [tped_prefix]
 
Use PLINK software to transpose your genotype files (bed or ped format) to tped/tfam format by running % plink --bfile [bed_prefix] (or --file [ped_prefix]) --recode12 --output-missing-genotype 0 --transpose --out [tped_prefix]
  
== Preparing Input Phenotype Files ==
+
=== Preparing Input Phenotype Files ===
 
Reformat the phenotype files in the same order of .tfam files. The phenotype file has three entries at each line, FAMID, INDID, and phenotype values. Missing phenotype values should be represented as "NA". It is simpler to regress out the covariates when generating the phenotypes, but it is possible to simultaneously adjust for covariates.
 
Reformat the phenotype files in the same order of .tfam files. The phenotype file has three entries at each line, FAMID, INDID, and phenotype values. Missing phenotype values should be represented as "NA". It is simpler to regress out the covariates when generating the phenotypes, but it is possible to simultaneously adjust for covariates.
 
Sample lines of phenotype files. (tab or space delimited)
 
Sample lines of phenotype files. (tab or space delimited)
  59811 859811 0.609109817670387  
+
  859811 859811 0.609109817670387  
 
  862311 862311 -0.0735227335684144  
 
  862311 862311 -0.0735227335684144  
 
  864111 864111 -0.210247209814720
 
  864111 864111 -0.210247209814720
Line 25: Line 25:
 
  88211 88211 -0.0165529689285573
 
  88211 88211 -0.0165529689285573
  
== Creating Marker-Based Kinship Matrix ==
+
=== Creating Marker-Based Kinship Matrix ===
Create kinship matrix (IBS or BN, BN is preferred) using emmax-kin. Make sure that both .tped and .tfam file exist with the same prefix.
+
Create kinship matrix (IBS or BN, BN is preferred) using emmax-kin. Make sure that both .tped and .tfam file exist with the same prefix. The intel implementation of emmax-kin should be orders of magnitude faster than previous implementation
IBS matrix 
 % emmax-kin-intel64 -v -s -d 10 [tped_prefix] (will generate [tped_prefix].aIBS.kinf)
 
BN (Balding-Nichols) matrix
 % emmax-kin-intel64 -v -d 10 [tped_prefix] (will generate [tped_prefix].aBN.kinf)
 
  
== Run EMMAX Association ==
+
IBS matrix
 +

 % emmax-kin-intel64 -v -s -d 10 [tped_prefix] (will generate [tped_prefix].aIBS.kinf)
 +
 
 +
BN (Balding-Nichols) matrix
 +

 % emmax-kin-intel64 -v -d 10 [tped_prefix] (will generate [tped_prefix].aBN.kinf)
 +
 
 +
=== Run EMMAX Association ===
 
Run EMMAX with the phenotype, tped/tfam files, and the kinship files as follows.
 
Run EMMAX with the phenotype, tped/tfam files, and the kinship files as follows.
 
  % emmax -v -d 10 -t [tped_prefix] -p [pheno_file] -k [kin_file] -o [out_prefix]
 
  % emmax -v -d 10 -t [tped_prefix] -p [pheno_file] -k [kin_file] -o [out_prefix]
+
 
This will generate the following files:* [out_prefix].reml : REML output with 6 lines, where each line represents (1) Log-likelihood with variance component (2) Log-likelihood without variance component, (3) \delta = \sigma_e^2 / \sigma_g^2 (Ratio between variance parameters) (4) \sigma_g^2 (genetic variance parameter), and (5) \sigma_e^2 (residual variance parameter), and (6) The pseudo-heritability estimates
 * [out_prefix].ps : Each line consist of [SNP ID], [beta], [SE-beta], [p-value].
+
This will generate the following files:
+
* [out_prefix].reml : REML output with 6 lines, where each line represents  
== Incorporating Covariates ==
+
*#Log-likelihood with variance component  
 +
*#Log-likelihood without variance component
 +
*# \delta = \sigma_e^2 / \sigma_g^2 (Ratio between variance parameters)
 +
*# \sigma_g^2 (genetic variance parameter)
 +
*# sigma_e^2 (residual variance parameter)
 +
*# The pseudo-heritability estimates
. (Explained variance by the kinship matrix)
 +
 
 +
* [out_prefix].ps : Each line consist of  
 +
*# SNP ID
 +
*# Beta (1 is effect allele)
 +
*# SE(beta)
 +
*# p-value.
 +
 
 +
=== Incorporating Covariates ===
 
If one wants to adjust for covariates simultanenously, add -c [cov_file] options to the above run, with the covariate file similar to the phenotype files, but allowing multiple columns ( > 3 ). Note that the intercept has to be included, meaning that the third column is recommended to be 1 always, and the covariates needs to be included from the fourth column. The order of the individual IDs should conform to the .tfam files, similar to the phenotype files.
 
If one wants to adjust for covariates simultanenously, add -c [cov_file] options to the above run, with the covariate file similar to the phenotype files, but allowing multiple columns ( > 3 ). Note that the intercept has to be included, meaning that the third column is recommended to be 1 always, and the covariates needs to be included from the fourth column. The order of the individual IDs should conform to the .tfam files, similar to the phenotype files.
 
Sample lines of covariate files  
 
Sample lines of covariate files  
Line 46: Line 63:
 
  101711 101711 1 2
 
  101711 101711 1 2
 
   
 
   
== Running EMMAX with dosages ==
+
=== Running EMMAX with dosages ===
 
If you add -Z option, it will accept .tped file format, where each individual is represented as one dosage value (ranging from 0 to 2), instead of two genotype columns, (which is one of the standard PLINK dosage format). You will be able to run EMMAX with this model. But when creating the kinship matrix, you will not be able to use dosage-based genotypes.
 
If you add -Z option, it will accept .tped file format, where each individual is represented as one dosage value (ranging from 0 to 2), instead of two genotype columns, (which is one of the standard PLINK dosage format). You will be able to run EMMAX with this model. But when creating the kinship matrix, you will not be able to use dosage-based genotypes.
  
== Support for VCF format and Gene-level Burden Test ==
+
== Frequently asked Questions ==
Use [[EPACTS]] software pipeline for running EMMAX with VCF files, including the implementations of gene-level burden tests.
+
 
 +
=== Effect allele ===
 +
Q. Which allele is effect allele?
 +
 
 +
A. EMMAX simply follows the encoding scheme of .tped file in additive model. So whichever allele encoded as 2 in the .tped file, it will be the effect allele (usually the major allele)
 +
 
 +
=== Support for VCF format and Gene-level Burden Test ===
 +
Q. Is there a version supporting VCF format and gene-level burden test?
 +
 
 +
A. Use [[EPACTS]] software pipeline for running EMMAX with VCF files, including the implementations of gene-level burden tests.
 +
 
 +
=== Encoding Case-control Phenotypes ===
 +
Q. I would like to run EMMAX for case-control phenotypes. How can I encode the phenotypes?
 +
 
 +
A. If you encode case/control to 2/1, you will be able to run case-control analysis. Because EMMAX is based on linear mixed model rather than generalized mixed model, the effect size (beta) would not be meaningful, but the p-values should be reliable (unless case/control counts are highly imbalanced).
 +
 
 +
=== IBS matrix or BN matrix? ===
 +
Q. Between IBS and BN matrix, which one is preferred?
 +
 
 +
A. We believe that BN matrix is more robust to construct the empirical kinship matrix. Also we recommend to use call rate 95% threshold and MAF threshold of 0.01 when preparing the data.
 +
 
 +
=== NaN in the kinship file ===
 +
Q. I am observing a series of -nan in the kinship matrix. What is the problem?
 +
 
 +
A. Most likely, monomorphic SNPs would create such a problem. Typically, MAF threshold such as 1% is used.
 +
 
 +
=== Citing EMMAX ===
 +
Q. How can I cite EMMAX if I used it in my research?
 +
 
 +
A. Please see http://www.ncbi.nlm.nih.gov/pubmed/20208533, or copy the line below.
 +
 
 +
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42:348-54
  
= Acknowledgements =
+
== Acknowledgements ==
 
This research was supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.
 
This research was supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.

Latest revision as of 09:58, 2 February 2017

EMMAX Overview

EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.

Download EMMAX

The latest release of EMMAX can be downloaded at EMMAX Download Page

Key Instructions

The instruction is based on latest INTEL binary of EMMAX release. See http://genetics.cs.ucla.edu/emmax/ for the documentation of the original version of the binary

Preparing Input Genotype Files

Use PLINK software to transpose your genotype files (bed or ped format) to tped/tfam format by running % plink --bfile [bed_prefix] (or --file [ped_prefix]) --recode12 --output-missing-genotype 0 --transpose --out [tped_prefix]

Preparing Input Phenotype Files

Reformat the phenotype files in the same order of .tfam files. The phenotype file has three entries at each line, FAMID, INDID, and phenotype values. Missing phenotype values should be represented as "NA". It is simpler to regress out the covariates when generating the phenotypes, but it is possible to simultaneously adjust for covariates. Sample lines of phenotype files. (tab or space delimited)

859811	859811	0.609109817670387 
862311	862311	-0.0735227335684144 
864111	864111	-0.210247209814720
865211	865211	-0.154258680369780
875511	875511	0.239822160194388
880111	880111	0.287436401143001
880811	880811	NA
881511	881511	0.114872064616424
88211	88211	-0.0165529689285573

Creating Marker-Based Kinship Matrix

Create kinship matrix (IBS or BN, BN is preferred) using emmax-kin. Make sure that both .tped and .tfam file exist with the same prefix. The intel implementation of emmax-kin should be orders of magnitude faster than previous implementation

IBS matrix 
 % emmax-kin-intel64 -v -s -d 10 [tped_prefix] (will generate [tped_prefix].aIBS.kinf)

BN (Balding-Nichols) matrix 
 % emmax-kin-intel64 -v -d 10 [tped_prefix] (will generate [tped_prefix].aBN.kinf)

Run EMMAX Association

Run EMMAX with the phenotype, tped/tfam files, and the kinship files as follows.

% emmax -v -d 10 -t [tped_prefix] -p [pheno_file] -k [kin_file] -o [out_prefix]

This will generate the following files:

  • [out_prefix].reml : REML output with 6 lines, where each line represents
    1. Log-likelihood with variance component
    2. Log-likelihood without variance component
    3. \delta = \sigma_e^2 / \sigma_g^2 (Ratio between variance parameters)
    4. \sigma_g^2 (genetic variance parameter)
    5. sigma_e^2 (residual variance parameter)
    6. The pseudo-heritability estimates
. (Explained variance by the kinship matrix)
  • [out_prefix].ps : Each line consist of
    1. SNP ID
    2. Beta (1 is effect allele)
    3. SE(beta)
    4. p-value.

Incorporating Covariates

If one wants to adjust for covariates simultanenously, add -c [cov_file] options to the above run, with the covariate file similar to the phenotype files, but allowing multiple columns ( > 3 ). Note that the intercept has to be included, meaning that the third column is recommended to be 1 always, and the covariates needs to be included from the fourth column. The order of the individual IDs should conform to the .tfam files, similar to the phenotype files. Sample lines of covariate files

100211 100211 1 2
100611 100611 1 2
100711 100711 1 3
100811 100811 1 4
101611 101611 1 2
101711 101711 1 2

Running EMMAX with dosages

If you add -Z option, it will accept .tped file format, where each individual is represented as one dosage value (ranging from 0 to 2), instead of two genotype columns, (which is one of the standard PLINK dosage format). You will be able to run EMMAX with this model. But when creating the kinship matrix, you will not be able to use dosage-based genotypes.

Frequently asked Questions

Effect allele

Q. Which allele is effect allele?

A. EMMAX simply follows the encoding scheme of .tped file in additive model. So whichever allele encoded as 2 in the .tped file, it will be the effect allele (usually the major allele)

Support for VCF format and Gene-level Burden Test

Q. Is there a version supporting VCF format and gene-level burden test?

A. Use EPACTS software pipeline for running EMMAX with VCF files, including the implementations of gene-level burden tests.

Encoding Case-control Phenotypes

Q. I would like to run EMMAX for case-control phenotypes. How can I encode the phenotypes?

A. If you encode case/control to 2/1, you will be able to run case-control analysis. Because EMMAX is based on linear mixed model rather than generalized mixed model, the effect size (beta) would not be meaningful, but the p-values should be reliable (unless case/control counts are highly imbalanced).

IBS matrix or BN matrix?

Q. Between IBS and BN matrix, which one is preferred?

A. We believe that BN matrix is more robust to construct the empirical kinship matrix. Also we recommend to use call rate 95% threshold and MAF threshold of 0.01 when preparing the data.

NaN in the kinship file

Q. I am observing a series of -nan in the kinship matrix. What is the problem?

A. Most likely, monomorphic SNPs would create such a problem. Typically, MAF threshold such as 1% is used.

Citing EMMAX

Q. How can I cite EMMAX if I used it in my research?

A. Please see http://www.ncbi.nlm.nih.gov/pubmed/20208533, or copy the line below.

Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42:348-54

Acknowledgements

This research was supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.