Difference between revisions of "RAREMETAL Documentation"
(→Additional Analysis Options)
(→Basic Usage Instructions)
|Line 72:||Line 72:|
A detailed [[Tutorial:_RareMETAL|'''TUTORIAL''']] with toy data are also available.
A detailed [[Tutorial:_RareMETAL|'''TUTORIAL''']] with toy data are also available.
==== List of Studies ====
==== List of Studies ====
Revision as of 20:34, 7 August 2013
rareMETAL is a tool for gene-based meta-analysis, based upon summary statistics generated from individual data using Rare Metal Worker. Another implementation of the same methods in R-package can be found RareMetals.
If you have any questions, please contact: sfengsph at umich dot edu
- 1 Useful Wiki Pages
- 2 Key Features
- 3 Brief Description
- 4 Approach
- 5 Download and Installation
- 6 Basic Usage Instructions
- 7 Additional Analysis Options
- 7.1 Input Files
- 7.2 Output Files
- 8 Example Usage
- 9 TUTORIAL
- 10 Change Log
Useful Wiki Pages
There are a few pages in this Wiki that may be useful to rareMETAL users. Here are links to key pages:
- The rareMETAL FAQ
rareMETAL has the following features:
- rareMETAL performs gene-based or region-based meta analysis using Burden tests with the following methods: CMC_counts, Madsen-Browning, SKAT, and Variable Threshold.
- rareMETAL performs single variant metal-analysis by default.
- rareMETAL allows customized groups of variants to be tested.
- rareMETAL allows conditional analysis to be performed in both gene-level meta-analysis and single variants meta-analysis.
- rareMETAL generated QQ plots and manhattan plots by default.
rareMETAL is a tool for meta-analysis rare variants from genotype arrays and sequencing. rareMETAL can combine summary statistics of individual studies generated by rareMetalWorker. It provides a convenient approach for both single variant and gene-level meta-analysis of rare variants from various studies, when joint-analysis of raw data from these studies are restricted.
The key idea behind meta-analysis with rareMETAL is that various gene-level test statistics can be reconstructed from single variant score statistics and that, when the linkage disequilibrium relationships between variants are known, the distribution of these gene-level statistics can be derived and used to evaluate signifi-cance. Single variant statistics are calculated using the Cochran-Mantel-Haenszel method. The main formulae are tabulated in the following:
Download and Installation
- University of Michigan CSG users can go to the following:
Where to Download
- The software package for Linux and Mac (source code included) can be downloaded here: software package download
How to Compile
- Save it to your local path and decompress using the following command:
tar xvzf raremetal.0.0.1.tgz
- Go to raremetal_0.0.1/raremetal/src and type the following command to compile:
How to Execute
- Go to raremetal_0.0.1/raremetal/bin and use the following:
- For example usage, please refer to [example command lines]
Basic Usage Instructions
raeMETAL is a command line tool. It is typically run from a Linux or Unix prompt by invoking the command
raremetal. An example is included at the bottom of this page.
A detailed TUTORIAL with toy data are also available.
List of Studies
- --studyName option is crucial for Rare Metal to work. Ignoring this option will lead to FATAL ERROR and Rare Metal will stop.
- The file should contain the path and prefix of the studies you want to include.
- If there is one or more studies that you want to excluded from your list, but want to save some effort of generating a new file, you can put a "#" in front of the line of record. Rare Metal will automatically exclude that study from meta analysis.
- An example file is in the following:
- The above example study name file guides Rare Metal looking for the following files as input (note that the second study has been opt out from the meta analysis, because of the "#" in front of the line)
Grouping from a Group File
- Grouping methods are only necessary when doing gene-based or group-based burden tests in meta-analysis.
- If none of the grouping method is specified, then only single variant meta-analysis will be performed.
- With --groupFile option, you can specify particular set of variants to be grouped for burden tests.
- The group file must be a tab or space delimited file in the following format:
GROUP_ID MARKER1_ID MARKER2_ID MARKER3_ID ...
- MARKER_ID must be in the following format:
- An example group file is:
PLEKHN1 1:901922:G:A 1:901923:C:A 1:902088:G:A 1:902128:C:T 1:902133:C:G 1:902176:C:T 1:905669:C:G HES4 1:934735:A:C 1:934770:G:A 1:934801:C:T 1:935085:G:A 1:935089:C:G ISG15 1:949422:G:A 1:949491:G:A 1:949502:C:T 1:949608:G:A 1:949802:G:A 1:949832:G:A AGRN 1:970687:C:T 1:976963:A:G 1:977028:G:T 1:977356:C:T 1:977396:G:A 1:978628:C:T 1:978645:G:A C1orf159 1:1021285:G:T 1:1021302:T:C 1:1021315:A:C 1:1021386:G:A 1:1022534:C:T 1:1025751:C:T 1:1026913:C:T
Grouping from an Annotated VCF File
- If --groupFile option is not specified, Rare Metal will look for an annotated vcf file as blue print for variants to group.
- The annotated VCF file should be specified using --annotatedVcf option.
- --annotation should be used with --annotatedVcf together when specific category of functional variants are of interest to be grouped. For example, if grouping nonsynonymous and splicing variants are of interests, the following should be included in command line:
--annotatedVcf your.annotated.vcf --annotation nonsyn/splicing Note: this allows you to group variants that are annotated starting with nonsyn or splicing (not case-sensitive).
- Special format for the annotated VCF file is required: all annotation information should be coded in INFO field in VCF file, starting with the key "ANNO=". An example annotated VCF file is in the following:
#CHROM POS ID REF ALT QUAL FILTER INFO 1 19208194 . G A 100 PASS AC=3;ANNO=nonsynonymous:ALDH4A1:NM_170726:exon8:c.C866T:p.P289L,ALDH4A1:NM_001161504:exon8:c.C686T:p.P229L,ALDH4A1:NM_003748:exon8:c.C866T:p.P289L,; ANNO=splicing:ALDH4A1 1 19208293 . G C 100 PASS AC=7;STUDIES=5;MAC=7;MAF=0.001;DESIGN=TBD_ASSAY;DSCORE=1.00; ANNO=nonsynonymous:ALDH4A1:NM_170726:exon8:c.C767G:p.P256R,ALDH4A1:NM_001161504:exon8:c.C587G:p.P196R,ALDH4A1:NM_003748:exon8:c.C767G:p.P256R,
- Notice that each variant is allowed to have more than one annotations; but each annotation should start with a new key "ANNO=" followed by annotation:genename:other transcript information.
Generate a VCF File to Annotate Outside of Rare Metal
- --writeVCF allows user to write a VCF file including pooled single variants from all studies. Then users can use their favorite annotation tool to annotate the VCF file. After annotating the VCF file, users can use that file as input for Rare Metal for further gene-based or region-based meta analysis.
- The output vcf file will be name as: yourPrefix.pooled.variants.vcf. An example output vcf file is in the following:
#CHROM POS ID REF ALT QUAL FILTER INFO 1 115658497 115658497 G A . . ALT_AF=0.380906; 2 74688884 74688884 G A . . ALT_AF=8.33611e-05; 3 121414217 121414217 C A . . ALT_AF=0.0747833;
- Rare Metal allows filtering of variants from individual studies by their HWE pvalue and call rate, which are generated as part of the output from Rare Metal Worker.
- To filter by HWE p-values, --hwe option should be used. The default is 0.0, which means not filtering any of the variants.
- To filter by call rate, --callRate option can be specified. The default is 0.0, which allows no filtering utilized.
- Currently, four methods are provided in Rare Metal, CMC type burden test, Madsen-Browning burden test, Variable Threshold burden test, and SKAT.
Additional Analysis Options
- --prefix allows customized prefix for output files.
- --maf specifies the minor allele frequency cutoff when doing gene-based or group-based burden tests. The default is maf<0.05.
- --longOutput allows users to output not only burden test results but also the single variant results (allele frequencies, effect sizes, and p-values) for the variants being grouped together. Please refer to the output files section for detailed explanation and examples.
- --tabulateHits works with --hitsCutoff together to generate reports for genes that have p-value less than specified cutoff from burden tests or SKAT. The default cutoff of p-value for genes to be reported is 1.0e-06, which can be specified by --hitsCutoff option. For more explanations and examples, please go to [Tabulated Top Hits].
Rare Metal needs the following as input:
List of Studies
- A file with the path and name of files containing summary statistics generated by raremetalworker should be specified.
- If no such file is provided, Rare Metal will stop and report FATAL ERROR.
- Please go to [example input for study names] for detailed explanation and examples.
Groups of Variants
To perform gene-based or group-based burden test, groups of variants need to be provided. There are two options to provide such information:
From Group File
- A group file contains the list of groups or genes with the variants to be included in your burden tests.
- Please refer to the instruction of --groupFile option for formats and examples.
From Annotated VCF File
- rareMETAL allows user to use annotated VCF file as input for grouping of variants, which is optional to input a group file as described above.
- rareMETAL also has the option of generating a VCF file according to the pooled information from individuals studies. Then user can use their favorite annotation tools to annotate the VCF file into the INFO field. Currently, rareMETAL only support limited formats of annotated VCF file.
- A more flexible way, which is also a recommended way, is to generate a group file from the customized annotated VCF file and use that as input to rareMETAL.
- For formats of annotated VCF that rareMETAL currently support, please refer to the following annotated VCF:
NOTE: if no grouping method is provided, then only single variant meta analysis will be performed.
Single Variant Meta Analysis Output
- Single variant meta analysis output has the following components: header and results.
- Header lines start with "##" shows summary of the meta analysis including method used, number of studies, and total sample size.
- Header line starts with "#" are column headers for results table.
- An example single variant meta analysis output is shown below:
##Method=SinglevarScore ##STUDY_NUM=2 ##TotalSampleSize=14308 #CHROM POS REF ALT POOLED_ALT_AF EFFECT_SIZE DIRECTION_BY_STUDY PVALUE 1 115658497 G A 0.380906 0.00954332 ++ 0.45828 2 74688884 G A 8.33611e-05 -0.196387 -! 0.845372 3 121414217 C A 0.0747833 0.0216982 -+ 0.34453 6 137245814 G C 0.000803746 0.105693 ++ 0.601805
- A detailed explanation of each column is in the following:
CHROM: Chromosome Name POS: Variant Position REF: Reference Allele Label ALT: Alternative Allele Label POOLED_ALT_AF: Pooled Alternative Allele Frequency EFFECT_SIZE: Alternative Allele Effect Size DIRECTION_BY_STUDY: Effect size direction of alternative allele from each study. The order of study is consistent with the order of studies listed in the input file for option --studyName. "?" means the variant is not observed or monomorphic from the study. "!" means the variant observed from this study has different alleles from those from the first study.
Burden Tests Meta Analysis Output
When --longOutput is specified, output includes both burden test results of genes and single variant results of the variants included in burden tests. Otherwise, single variant results of variants included in burden tests will not be included in the output.
Long Output Format
- Here is an example of output file from SKAT when --longOutput is specified.
##Method=Burden ##STUDY_NUM=2 ##TotalSampleSize=14308 #GROUPNAME NUM_VAR VARs MAFs SINGLEVAR_EFFECTs SINGLEVAR_PVALUEs AVG_AF MIN_AF MAX_AF EFFECT_SIZE PVALUE NOC2L 7 1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A 0.000166722,0.0242172,0.0109203,0.0355845,0.0333729,0.00700233,0.00200067 -0.183575,-0.00228307,-0.0598337,0.0220595,0.0229464,-0.0302768,-0.0200417 0.790161,0.953446,0.515806,0.503548,0.499251,0.791773,0.926625 0.0161807 0.000166722 0.0355845 0.00667875 0.662531 KLHL17 2 1:897285:A:G;1:898869:C:T 0.0148408,0.00108369 -0.0502034,-0.0256403 0.528269,0.934606 0.00796222 0.00108369 0.0148408 -0.0484494 0.528878
Short Output Format
- Here is an example of output file from SKAT when --longOutput is not specified.
##Method=Burden ##STUDY_NUM=2 ##TotalSampleSize=14308 #GROUPNAME NUM_VAR VARs AVG_AF MIN_AF MAX_AF EFFECT_SIZE PVALUE NOC2L 7 1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A 0.0161807 0.000166722 0.0355845 0.00667875 0.662531 KLHL17 2 1:897285:A:G;1:898869:C:T 0.00796222 0.00108369 0.0148408 -0.0484494 0.528878
Tabulated Top Hits
- When --tabulateHits is specified, top hits from Burden tests will be generated. Each method will have an individual tabulated file generated. The purpose of this tabulated file is to list burden test results of top hits together with single variant results from variants being grouped in burden tests. The difference between this file and the standard long-format output file from burden test is that each row of the file represents a single variant that is included in the gene for burden test. This format allows each sorting on users end.
- Tabulated top hits are saved in the file:
yourPrefix.meta.tophits.youMethod.tbl (example files names: TG.meta.tophits.burden.tbl, LDL.meta.tophits.SKAT.tbl)
- The following items are tabulated in the output:
GENE: Gene name. METHOD: Burden test used. GENE_PVALUE: P-value from gene-based burden tests. MAF_CUTOFF: MAF cutoff used when doing gene-based tests. ACTUAL_CUTOFF: Actual MAF cutoff used. (This will be different from MAF_CUTOFF only for Variable Threshold method. Otherwise, it will be the same as MAF_CUTOFF.) VAR: Variant name in CHR:POS:REF:ALT format. MAF: Single variant pooled MAF from all samples. EFFSIZE: Effect size from single variant meta analysis. PVALUE: Pvalue from single variant meta analysis.
- An example of tabulated hits from a standard burden test with maf<0.05 as criterion is shown in the following:
GENE METHOD GENE_PVALUE MAF_CUTOFF ACTUAL_CUTOFF VARS MAFS EFFSIZES PVALUES PCSK9 BURDEN_0.050 7.54587e-11 0.05 0.05 1:55505647:G:T 0.0396631 -0.442192 2.10159e-46 PCSK9 BURDEN_0.050 7.54587e-11 0.05 0.05 1:55518371:G:A 0.0237138 0.0548733 0.430246 PCSK9 BURDEN_0.050 7.54587e-11 0.05 0.05 1:55529187:G:A 0.0433324 0.0946321 0.00129942 APOE BURDEN_0.050 2.83457e-72 0.05 0.05 19:45412079:C:T 0.0413056 -0.554561 2.83457e-72
- According to the example above, PCSK9 had a p-value of 7.54587e-11 from the gene-based burden test, where three variants from this gene were included. Another hit from this meta analysis is APOE, where only one variant was included in the burden test.
- A log file is automatically generated by Rare Metal to save the parameters in effect. An example is in the following:
The following parameters are in effect: List of Studies: ============================ --studyName [studyName.SardiNia] Grouping Methods: ============================ --groupFile  --annotatedVcf [../../groupvcf/bin/debug/nonsynonymous.vcf] --annotation  --writeVcf [OFF] QC Options: ============================ --hwe  --callRate  Association Methods: ============================ --burden [true] --MB [false] --SKAT [false] --VT [false] Other Options: ============================ --prefix [test] --maf [0.05] --longOutput [false] --tabulateHits [false] --hitsCutoff [1e-06]
- Here is an example command line to do single variant meta analysis only:
./raremetal --studyName your.studyName.file --prefix yourPrefix
- When you want to do all burden tests using a group file to specify which variants to group:
./raremetal --studyName your.studyName.file --groupFile your.groupfile --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix (NOTE: this will generate single variant meta analysis result and the short format output for burden test results.)
- Here is how to do all SKAT meta analysis using a group file and request a long format output together with tabulated hits:
./raremetal --studyName your.studyName.file --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --prefix yourPrefix
- Here is an example of adding QC filters to variants when doing meta analysis.
./raremetal --studyName your.studyName.file --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
- Here is how to do the same thing but reading grouping information from an annotated VCF file:
./raremetal --studyName your.studyName.file --annotatedVcf your.annotated.vcf --annotation nonsyn/stop/splicing --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
- If you want to write a VCF file of pooled variants from all studies, annotate them using your favorite annotation program, and then come back to Rare Metal with the annotate VCF file to do burden tests:
First, use the following command to write the VCF file: ./raremetal --studyName your.studyName.file --writeVcf --prefix yourPrefix Second, annotate the VCF file using your favorite annotation program. (Annotated VCF file has to follow the format described here: [annotated VCF format]) Third, use the following command to do meta analysis: ./raremetal --studyName your.studyName.file --annotatedVcf your.annotated.vcf --annotation nonsyn/splicing/stop --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix
- For a comprehensive tutorial of RareMetalWorker and RareMETAL using example data sets, please go to the following:
- Version 0.0.1 released to U of M CSG group. (2/13/2013)
- Version 0.0.1 released to public. (2/24/2013)
- Version 0.1.0 released to public after fixing a few bugs, adding conditional analysis and automatic graphing to the tool. (8/5/2013)