Open main menu

Genome Analysis Wiki β

Difference between revisions of "RAREMETAL Documentation"

(Created page with ''''Rare-Metal-Worker''' is a tool for generating summary statistics for rare variants and gene level meta analyses using Rare-Metal. It handles both related individuals and unrel…')
 
(Update contact address)
 
(442 intermediate revisions by 6 users not shown)
Line 1: Line 1:
'''Rare-Metal-Worker''' is a tool for generating summary statistics for rare variants and gene level meta analyses using Rare-Metal. It handles both related individuals and unrelated individuals.
+
[[Category:RAREMETAL]]
 +
== Useful Wiki Pages ==
  
If you have any questions, please contact: sfengsph at umich dot edu
+
* Git hub page: https://github.com/statgen/Raremetal
  
== Change Log ==
+
* The [[RAREMETAL_Change_Log | Change Log]]
* Version 0.0.1 was released on 11/13/2012.
+
 
* Modified Rare-Metal-Worker to let it output LD matrix by a sliding window. (11/14/2012)
+
* The [[RAREMETAL_DOWNLOAD_%26_BUILD | DOWNLOAD page]]
* Uploaded to public wiki. (11/16/2012)
+
 
* Enabled writing log file by defalut. (11/18/2012)
+
* The [[Tutorial:_RAREMETAL|RAREMETAL Quick Start Tutorial]]
* Forced sample IDs to be matched when reading in kinship from a file. Perform a sanity check before reading in kinship file. If a sample of interest is not included in kinship file, then fatal error will occur. (11/19/2012)
+
* The [[RAREMETAL METHOD]]
* Added HWE pvalue and call rate in summary statistics output. (11/27/2012)
+
 
* Bugs fixed to solve compiling errors on some machines (Thank you Mary Kate!). Version 0.0.2 released. (11/30/2012)
+
* The [[RAREMETAL FAQ]]
* Updated output format. Version 0.0.3 released. (12/3/2012)
+
 
* More messages coded into log file. (12/4/2012)
+
* The [[RAREMETAL Command Reference]]
* Version 0.0.4 released. (12/5/2012)
+
 
* Bug fixed for empirical kinship calculation when genotypes are read from VCF file. Version 0.0.5 released. (12/6/2012)
+
* The [[RAREMETALWORKER|RAREMETALWORKER documentation]]
* Version 0.0.6 released. (12/6/2012)
+
 
* Updated output format for monomorphic sites. (12/7/2012)
+
The [http://genome.sph.umich.edu/wiki/Rvtests '''rvtests'''] tool for rare-variant association analysis can also generate output compatible with RAREMETAL.
* Changed executable name into bin/raremetalworker. Version 0.0.7 released. (12/10/2012)
+
 
* Fixed a bug when reading genotypes from vcf file. (2/5/2013)
+
== Brief Description ==
 +
 
 +
'''RAREMETAL''' is a computationally efficient tool for meta-analysis of rare variants using sequencing or genotyping array data. It takes summary statistics and LD matrices generated by [[Rare-Metal-Worker|'''RAREMETALWORKER''']] or [http://genome.sph.umich.edu/wiki/Rvtests '''rvtests'''], handles related and unrelated individuals, and supports both single variant and burden meta-analysis. It generates high quality plots by default and has options that allow users to build reports at different levels.
 +
 
 +
'''RAREMETAL''' is developed by Shuang Feng, Dajiang Liu and Gonçalo Abecasis. A R-package written by Dajiang Liu using the same methodology is [[RareMetals|'''available''']].
  
 
== Key Features ==
 
== Key Features ==
Rare-Metal-Worker has the following features:
+
'''RAREMETAL''' has the following features:
* Takes genotypes from either PED file or VCF file.
+
* Performs gene-based or region-based meta analysis using Burden tests with the following methods: CMC_counts, Madsen-Browning, SKAT, and Variable Threshold.
* Generates summary statistics for both related and unrelated individuals.
+
* Performs single variant metal-analysis by default.
* Generates linkage disequilibrium matrices summarizing covariance between single marker statistics using an adjustable sliding window.  
+
* Allows customized groups of variants to be tested.
* Optionally handles related individuals using a kinship matrix derived from either pedigree or genotype data.
+
* Allows conditional analysis to be performed in both gene-level meta-analysis and single variants meta-analysis.
* Has the option of fitting shared environment.
+
* Generate QQ plots and manhattan plots by default.
* Can handle variants on Chromosome X.
+
 
 +
== Approach ==
 +
 
 +
The key idea behind meta-analysis with RAREMETAL is that various gene-level test statistics can be reconstructed from single variant score statistics and that, when the linkage disequilibrium relationships between variants are known, the distribution of these gene-level statistics can be derived and used to evaluate signifi-cance. Single variant statistics are calculated using the Cochran-Mantel-Haenszel method. Our method has been published in [http://www.nature.com/ng/journal/v46/n2/abs/ng.2852.html '''Liu et. al'''] in Nature Genetics. Please go to [http://genome.sph.umich.edu/wiki/RAREMETAL_method '''method'''] for details.
 +
 
 +
== Download and Installation ==
 +
 
 +
We have tested compilation using our source code on several platforms including Linux, and Mac OS X.
 +
 
 +
For source code and executables together with instructions of building from source, please go to [[RAREMETAL_DOWNLOAD_%26_BUILD |'''DOWNLOAD source and executables''']].
 +
 
 +
For questions about compilation, please go to [[RAREMETAL_FAQ | '''FAQ''']].
 +
 
 +
== Basic Usage Instructions ==
 +
 
 +
'''RAREMETAL''' is a command line tool. It is typically run from a Linux or Unix prompt by invoking the command <code>raremetal</code>. In the following are descriptions of basic usage for meta analysis. A detailed [[Tutorial:_RareMETAL|'''TUTORIAL''']] with toy data are also available.
 +
 
 +
==== Prepare Input Files====
 +
'''RAREMETAL''' requires the following basic input files: summary statistics and covariance matrices of score statistics generated by '''RAREMETALWORKER''' or [http://genome.sph.umich.edu/wiki/Rvtests '''rvtests'''], a file with list of studies to be included and a group file if gene-level meta-analysis is expected.  
 +
 
 +
 
 +
=====Summary Statistics=====
 +
Files containing summary statistics and LD matrices generated by '''RAREMETALWORKER''' should be compressed and [http://samtools.sourceforge.net/tabix.shtml '''tabix'''] indexed using the following commands (Note in RAREMETALWORKER, if --zip is specified, these .gz and .tbi files will be automatically generated):
 +
 
 +
bgzip study1.singlevar.score.txt
 +
tabix -s 1 -b 2 -e 2 -c "#" study1.singlevar.score.txt.gz
 +
bgzip study1.singlevar.cov.txt
 +
tabix -s 1 -b 2 -e 2 -c "#" study1.singlevar.cov.txt.gz
 +
 
 +
Files containing summary statistics and LD matrices generated by '''rvtests''' should be compressed and [http://samtools.sourceforge.net/tabix.shtml '''tabix'''] indexed using the following commands:
  
== Software Download and Installation ==
+
bgzip study1.MetaScore.assoc
 +
tabix -s 1 -b 2 -e 2 -S 1 study1.MetaScore.assoc.gz
 +
tabix -s 1 -b 2 -e 2 -S 1 study1.MetaCov.assoc.gz
  
=== Where to Download ===
+
=====List of Studies=====
 +
* --summaryFiles option is crucial for '''RAREMETAL''' to work. Ignoring this option would lead to FATAL ERROR and '''RAREMETAL''' would stop.
 +
* The file should contain the path and prefix of the studies you want to include.
 +
* If there is one or more studies that you want to excluded from your list, but want to save some effort of generating a new file, you can put a "#" in front of the line of record. '''RAREMETAL''' would automatically exclude that study from meta analysis. An example list of summary file is in the following:
  
* The source package for Linux and Mac can be downloaded here: [[Media:RareMetalWorker.0.0.8.tgz|software package download]]
+
  /net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/TwinsUK/TwinsUK.TG.singlevar.score.txt.gz
* Save it to your local path and decompress using the following command:
+
   #/net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/HUNT/RareMetalWorker/HUNT_MI_case.TG.singlevar.score.txt.gz
   tar xvzf RareMetalWorker.0.0.8.tgz
 
* For UM CSG cluster users, no installation is needed. It is available at /net/fantasia/home/sfengsph/code/Rare-Metal/RareMetalWorker/bin/raremetalworker
 
  
=== How to Compile ===
+
* When gene-level analysis is requested, --covFiles option should be used to specify the covariance files. An example file is:
  
* Go to /RareMetalWorker_0.0.8/RareMetalWorker/src and use the following command:
+
  /net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/TwinsUK/TwinsUK.TG.singlevar.cov.txt.gz
+
  #/net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/HUNT/RareMetalWorker/HUNT_MI_case.TG.singlevar.cov.txt.gz
  make all
 
  
=== How to Execute ===
+
* The above example study name file guides '''RAREMETAL''' to look for summary statistics from TwinsUK study only, because "HUNT" study is commented out. The following two files are needed for '''RAREMETAL''' to perform further analysis together with their tabix index file are needed.
  
* To execute the program, go to /RareMetalWorker_0.0.6/RareMetalWorker/bin, then the program can be executed by ./Rare-Metal-Worker.
+
* Please sepcify --dosage option if input files were generated from dosage instead of genotype.
* An example command line for a related sample when you have genotype info saved in VCF file is as following:
 
  ./raremetalworker --ped your.pheno.ped --dat your.pheno.dat --vcf your.geno.vcf.gz --useCovariates --inverseNormal --prefix your.study
 
* An example command line for a related sample when you have genotype info saved in PED/DAT file is as following:
 
  ./raremetalworker --ped your.ped --dat your.dat --useCovariates --inverseNormal --prefix your.study
 
* An example command line for an unrelated sample when you have genotype info saved in PED/DAT file is as following:
 
  ./raremetalworker --ped your.ped --dat your.dat --useCovariates --inverseNormal --prefix your.study
 
* An example command line for an unrelated sample when you have genotype info saved in VCF file is as following:
 
  ./raremetalworker --ped your.pheno.ped --dat your.pheno.dat --vcf your.geno.vcf.gz --useCovariates --inverseNormal --prefix your.study
 
* An example command line to use when you have genotype info saved in VCF file and you want to adjust covariates first and then inverse normalize residuals is as following:
 
  ./raremetalworker --ped your.pheno.ped --dat your.pheno.dat --vcf your.geno.vcf.gz --makeResiduals --useCovariates --inverseNormal --prefix your.study
 
* For more examples, please go to [[Examples]].
 
  
== Software Specifications ==
+
=====Group Rare Variants=====
  
=== Input Files ===
+
====== From a Group File ======
Rare-Metal-Worker needs the following files as input: PED and DAT file in Merlin format, '''AND/OR''' a VCF file. When genotypes are stored in PED and DAT file, the VCF file is not needed. However, even if genotypes are saved in a VCF file, PED and DAT files are still needed for carrying covariate and trait information.
+
* Grouping methods are only necessary when doing gene-based or group-based burden tests in meta-analysis.
 +
* If none of the grouping method is specified, then only single variant meta-analysis will be performed.
 +
* With --groupFile option, you can specify particular set of variants to be grouped for burden tests.
 +
* The group file must be a tab or space delimited file in the following format:
 +
  GROUP_ID MARKER1_ID MARKER2_ID MARKER3_ID ...  
 +
* MARKER_ID must be in the following format:
 +
  CHR:POS:REF:ALT
 +
* An example group file is:
 +
  PLEKHN1 1:901922:G:A    1:901923:C:A    1:902088:G:A    1:902128:C:T    1:902133:C:G    1:902176:C:T    1:905669:C:G       
 +
  HES4    1:934735:A:C    1:934770:G:A    1:934801:C:T    1:935085:G:A    1:935089:C:G
 +
  ISG15  1:949422:G:A    1:949491:G:A    1:949502:C:T    1:949608:G:A    1:949802:G:A    1:949832:G:A
 +
  AGRN    1:970687:C:T    1:976963:A:G    1:977028:G:T    1:977356:C:T    1:977396:G:A    1:978628:C:T    1:978645:G:A           
 +
  C1orf159        1:1021285:G:T  1:1021302:T:C  1:1021315:A:C  1:1021386:G:A  1:1022534:C:T  1:1025751:C:T  1:1026913:C:T
  
==== PED and DAT Files ====
+
====== From an Annotated VCF File ======
* When PED file has genotypes saved, there is no need for a VCF file as input.
+
If --groupFile option is '''NOT''' specified, '''RAREMETAL''' will look for an annotated vcf file as blue print for variants to group. Users are also allowed to generate a vcf file based on the superset of variants from pooled samples, and annotate outside RAREMETAL. Then, annotated vcf file can be used as input for RAREMETAL for gene-level meta-analysis, or group files can be generated based on the annotated vcf file. Detailed description of these options are [[Rare-Metal#Group_Rare_Variants_from_Annotated_VCF|'''available''']]. There are also [[Rare-Metal#Example_Command_lines|'''examples''']] of this usage at the bottom of this page.
* Rare-Metal-Worker takes PED/DAT file in Merlin format. Please refer to [[http://www.sph.umich.edu/csg/abecasis/merlin/tour/input_files.html PED/DAT format description]] for details.
 
* An example PED file is in the following:
 
    1 1 0 0 1 1.5 1 23 A A A A A A A A A A
 
    2 1 0 0 1 1.0 1 34 A C A C A C A C A C
 
    3 1 0 0 2 0.4 1 43 A A A A A A A A A A
 
    4 1 0 0 2 0.9 1 13 A C A C A C A C A C
 
* The matching DAT file is in the following:
 
  T YourTraitName
 
  C SEX
 
  C AGE
 
  M 1:123456
 
  M 1:234567
 
  M 2:111111
 
  M 2:222222
 
  M X:12345
 
* DAT file must have variant names in the following format "M chr:pos".
 
* Orders of labels in DAT file have to match the order of fields in PED file.
 
* '''Markers in PED and DAT file must be sorted by chromosome and position.'''
 
  
* Covariate and trait values are saved in PED file. Covariate and trait descriptions are saved in DAT file.
+
==== QC Options ====
 +
* '''RAREMETAL''' allows filtering of variants from individual studies by their HWE pvalue and call rate, which are generated as part of the output from '''RAREMETALWORKER''' or [http://genome.sph.umich.edu/wiki/Rvtests '''rvtests'''].
 +
* To filter by HWE p-values, --hwe option should be used. The default is 0.0, which means not filtering any of the variants.
 +
* To filter by call rate, --callRate option can be specified. The default is 0.0, which allows no filtering utilized.
  
==== VCF File ====
+
==== Association Options====
* Another option is to use VCF as input. Please refer to the following link for VCF file specification: [[http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 1000 genome wiki VCF specs]]
+
* Currently, CMC type burden test, Madsen-Browning burden test, Variable Threshold burden test and SKAT are provided in '''RAREMETAL''', by specifying --burden, --MB, --VT and --SKAT.
* VCF file should be compressed by bgzip and indexed by tabix, using the following command:
+
* --maf specifies the minor allele frequency cutoff when doing gene-based or group-based burden tests. Variants with maf '''above''' this threshold will be ignored. The default is maf<0.05.
  bgzip input.vcf    ## this command will produce input.vcf.gz
+
* In '''a single study''' of sample size N, if a site is monomorphic or not reported in vcf/ped, it is considered that the sample size of this study is not large enough to sample the rare allele. Thus, this study contributes 2*N reference alleles and 0 alternative allele towards meta-analysis. To let such studies contribute no alleles towards pooled allele frequency, specify --altMAF.
  tabix -pvcf -f input.vcf.gz  ## this command will produce input.vcf.gz.tbi
 
* Even with the presence of VCF file, PED/DAT files are still needed for covariates and phenotypes.
 
  
=== Software Options ===
+
==== Conditional Analysis====
The following options are currently available in Rare-Metal-Worker:
+
* To decide whether a signal is caused by shadowing a significant common variant nearby, '''RAREMETAL''' also enables conditional analysis with a list of variants to be conditioned upon provided in a file as input for --condition option. An example input file should be space or tab delimited as in the following. When alleles do not match the ref and alt alleles from samples, the variant will be skipped from conditional analysis.
  
    Options:
+
1:861349:C:T 1:905901:G:A 20:986998:G:C 22:3670691:A:G
      Input Files : --ped [Exomechip.pheno.ped], --dat [Exomechip.pheno.dat], --vcf []
 
      Output Files : --prefix [], --LDwindow [1000000]
 
        VC Options : --vcShared, --vcX, --useCovariates [ON]
 
    Trait Options : --makeResiduals, --inverseNormal [ON], --traitName []
 
    Kinship Source : --kinPedigree [ON], --kinGeno, --kinFile [], --kinSave
 
  Kinship Options : --kinMaf [0.05], --kinMiss [0.05]
 
      Chromosome X : --xLabel [X], --xStart [2699520], --xEnd [154931044]
 
  
 +
== Additional Analysis Options ==
  
==== Input Files ====
+
=== Generate a VCF File to Annotate Outside RAREMETAL ===
* When genotypes are saved in a VCF file, PED and DAT files are used for specifying pedigree structure, covariate and trait information. An example command line might look like this:
+
* --writeVCF allows user to write a VCF file including pooled single variants from all studies. Then users can use their favorite annotation tool to annotate the VCF file. After annotating the VCF file, users can use that file as input for '''RAREMETAL''' for further gene-based or region-based meta analysis.
   --ped input.ped --dat input.dat --vcf input.vcf.gz
+
* The output vcf file will be name as: yourPrefix.pooled.variants.vcf. An example output vcf file is in the following:
* When genotypes are saved in the PED file, the VCF file is not needed. An example command line might look like this:
+
   #CHROM    POS    ID      REF    ALT    QUAL    FILTER  INFO
  --ped input.ped --dat input.dat
+
  1      115658497      115658497      G      A      .      .      ALT_AF=0.380906;
 +
  2      74688884        74688884        G      A      .      .      ALT_AF=8.33611e-05;
 +
  3      121414217      121414217      C      A      .       .       ALT_AF=0.0747833;
 +
===Annotation===
 +
* RAREMETAL automatically recognizes the annotation format generated by [[TabAnno | '''ANNO''']] or [[EPACTS#Annotating_VCF_file_using_EPACTS | '''EPACTS''']].
 +
* To annotate a the VCF generated in previous step, you can use the following command:
 +
./anno --in your.in.vcf.gz --out your.out.vcf.gz
  
==== Output Files ====
+
=== Group Rare Variants from Annotated VCF ===
* --prefix is optional.
+
* If --groupFile option is '''NOT''' specified, '''RAREMETAL''' will look for an annotated vcf file as blue print for variants to group.
* If --prefix is not specified, the output file names will be:
+
* The annotated VCF file should be specified using --annotatedVcf option.  
  traitname.singlevar.score.txt
+
* --annotation should be used with --annotatedVcf together when specific category of functional variants are of interest to be grouped. For example, if grouping nonsynonymous and splicing variants are of interests, the following should be included in command line:
  traitname.singlevar.cov.txt
+
* (! only available after v4.13.8) when --annotation is not specified, raremetal groups all non-intergenic variants.
* If --prefix prefix is specified, then the output file names are:
 
  prefix.traitname.singlevar.score.txt
 
  prefix.traitname.singlevar.cov.txt
 
* --LDwindow specifies the length of the window that LD Matrix should be generated upon each variant. The default is 1MB.
 
  
==== VC Options ====
+
  --annotatedVcf your.annotated.vcf --annotation nonsyn/splicing
* When --vcShared and --vcX are specified, Rare-Metal-Worker knows that you want to fit shared environment and/or chromosome X variance component together with genetic component and non-shared environment.
+
  Note: this allows you to group variants that are annotated starting with nonsyn or splicing (not case-sensitive).
* When --useCovariates is specified, Rare-Metal-Worker understands covariates should be read from PED file. Covariates are modeled as fixed effects.
 
==== Trait Options ====
 
* --makeResiduals can be combined with --useCovariates to generate residuals from a simple linear regressions before analysis. If the --inverseNormal option is also used, then the residuals will be quantile normalized before fitting variance component model.
 
** An example Command line requesting pre-adjustment for covariates before fitting a variance component follows:
 
  --useCovariates --makeResiduals --inverseNormal
 
** An example command line requesting joint modeling of fixed effects and variance components follows:
 
  --useCovariates --inverseNormal
 
* If --inverseNormal is used WITHOUT --makeResiduals, then trait values are inverse normalized before any model fitting.
 
* --traitName is created for situations when you have many traits saved in your PED and DAT file, but you are interested in one or a few of them. It can read a file ending with .txt with each trait of interest in a separate line, or trait names separated with "/". An example to handle one trait or multiple traits is in the following:
 
  --traitName LDL
 
  --traitName LDL/HDL/TG
 
  --traitName traitsOfInterest.txt
 
* If --traitName is not used, all traits in PED/DAT file will be analyzed.
 
  
==== Kinship Source ====
+
* Special format for the annotated VCF file is required: all annotation information should be coded in INFO field in VCF file, starting with the key "ANNO=". An example annotated VCF file is in the following:
* --kinPedigree allows Rare-Metal-Worker to generate kinship matrix from pedigree, when pedigree information is available. This option is on by default.
 
* --kinGeno informs Rare-Metal-Worker to generate kinship matrix from all available variants that pass the criteria, specified in --kinMaf and --kinMiss options. The default will take variants with MAF>0.05 and genotype missing rate <0.05.
 
* --kinFile let Rare-Metal-Worker read in a kinship matrix from a file. The first row of the kinship file has to be the sample IDs included in the kinship file. If a sample of interest is not included in the kinship file, fatal error will occur and the program will be terminated. A sample of interest is a sample that is phenotyped and has all covariates measured when --useCovariates is specified.
 
* --kinSave allows you to save the kinship matrix.
 
  
==== Kinship Options ====
+
  #CHROM    POS    ID      REF    ALT    QUAL    FILTER  INFO
* --kinMiss and --kinMaf should be used with --kinGeno together.  
+
  1      19208194        .      G      A      100    PASS     
* --kinMiss specifies the maximum genotype missing rate when calculating kinship from genotypes. The default is 0.05.
+
  AC=3;'''ANNO='''nonsynonymous:ALDH4A1:NM_170726:exon8:c.C866T:p.P289L,ALDH4A1:NM_001161504:exon8:c.C686T:p.P229L,ALDH4A1:NM_003748:exon8:c.C866T:p.P289L,;
* --kinMaf specifies the minimum minor allele frequency used when calculating kinship from genotypes. The default is 0.05.
+
  '''ANNO='''splicing:ALDH4A1
 +
  1      19208293        .      G      C      100    PASS    AC=7;STUDIES=5;MAC=7;MAF=0.001;DESIGN=TBD_ASSAY;DSCORE=1.00;
 +
  '''ANNO='''nonsynonymous:ALDH4A1:NM_170726:exon8:c.C767G:p.P256R,ALDH4A1:NM_001161504:exon8:c.C587G:p.P196R,ALDH4A1:NM_003748:exon8:c.C767G:p.P256R,
  
==== Chromosome X ====
+
* Notice that each variant is allowed to have more than one annotations; but each annotation should start with a new key "ANNO=" followed by annotation:genename:other transcript information.
* --xLabel should have a value of a string which specifies how variants on chromosome X are coded. The default is "X".
+
* Generated group file will be named test.groupfile under your running directory.
* --xStart and --xEnd specifies the start and end of non-pseudo-autosomal regions on chromosome X. These options should be specified when --vcX is used.
 
* The default for --xStart is 2699520 and default for --xEnd is 154931044, according to NCBI genome build 37.
 
  
=== Handling Unrelated Individuals ===
+
===Options for Report Generation===  
* To let Rare-Metal-Worker handle unrelated individuals, we just have to code the individuals as unrelated in PED file, or each individual belongs to a unique family. Then Rare-Metal-Worker will take care of the rest.
+
* --correctGC generates QQ plots and manhattan plots with pvalues corrected using genomic control.
* However, when --kinGenotype is also used, Rare-Metal-Worker will consider them as related and generate kinship matrix from genotypes.
+
* --prefix allows customized prefix for output files.
* An example is shown as following (header is included for illustration purpose, not in real PED file):
+
* --longOutput allows users to output not only burden test results but also the single variant results (allele frequencies, effect sizes, and p-values) for the variants being grouped together. Please refer to the output files section for detailed explanation and examples.
 +
* --tabulateHits works with --hitsCutoff together to generate reports for genes that have p-value less than specified cutoff from burden tests or SKAT. The default cutoff of p-value for genes to be reported is 1.0e-06, which can be specified by --hitsCutoff option. For more explanations and examples, please go to [[Rare-Metal#TABULATED_HITS| Tabulated Hits]].
  
  famid pid fid mid sex age trait
+
===Miscellaneous Options===
  1    1.1  0  0  1  10  -0.3
+
* --tabix allows rapid analysis when number of groups/genes of interests are small. Currently, when number of groups is less than 100, --tabix option is automatically turned on.
  2    2.1  0  0  1  56  0.0
 
  3    3.1  0  0  2  31  0.4
 
  4    4.1  0  0  2  23  0.008
 
  5    5.1  0  0  2  34  2.35
 
  
== Output Formats ==
+
== Reports Generated by RAREMETAL ==
 +
=== Single Variant Meta Analysis Output ===
  
* There are three files generated automatically by default:
+
==== TABLES ====
  
  prefix.traitName.singlevar.score.txt
+
* Single variant meta analysis output has the following components: header, results and footnote.  
  prefix.traitName.singlevar.cov.txt
+
* Header lines start with "##" shows summary of the meta analysis including method used, number of studies, and total sample size.  
  prefix.singlevar.log
+
* Header line starts with "#" are column headers for results table.
+
* Footnote also starts with "#", where genomic controls from each study and the overall sample are reported.
* prefix.traitName.singlevar.score.txt contains summary statistics that are needed by Rare-Metal. An example is shown in below:
+
* An example single variant meta analysis output is shown below:
  
   LDL mean= -0.00, variance= 1.00, heritability= 34.30
+
   ##Method=SinglevarScore
   CHR     POS     REF_ALLELE      ALT_ALLELE      INFORMATIVE_N   FOUNDER_AF      ALL_AF  INFORMATIVE_AC  HWE_PVALUE      STAT    ALT_ALLELE_EFFSIZE     PVALUE
+
  ##STUDY_NUM=2
   10  45410002        G      A      6103    0.0341589      0.0341589      410    0.165893       126.205 0.309798        4.03074e-10
+
  ##TotalSampleSize=14308
   19  45412079        G       A      6103    0.0368124       0.0368124       434    0.714645        -265.84 -0.587356      7.87851e-36
+
   #CHROM  POS     REF    ALT     POOLED_ALT_AF   EFFECT_SIZE    DIRECTION_BY_STUDY     PVALUE
   19  45414451        G      A       6103    0.444989        0.444989       5312    0.0759271      -26.1212        -0.00837122    0.640058
+
   1      115658497      G      A      0.380906        0.00954332      ++      0.45828
 +
  2      74688884       G      A      8.33611e-05    -0.196387      -!      0.845372
 +
   3      121414217      C       A      0.0747833       0.0216982       -+      0.34453
 +
   6      137245814      G      C       0.000803746    0.105693       ++      0.601805
 +
* A detailed explanation of each column is in the following:
  
* pvalues from the above output are from the family-based single variant score test.
+
  CHROM:              Chromosome Name
 +
  POS:                Variant Position
 +
  REF:                Reference Allele Label
 +
  ALT:                Alternative Allele Label
 +
  POOLED_ALT_AF:      Pooled Alternative Allele Frequency
 +
  EFFECT_SIZE:        Alternative Allele Effect Size
 +
  DIRECTION_BY_STUDY: Effect size direction of alternative allele from each study.
 +
                      The order of study is consistent with the order of studies listed in the input file for option --summaryFiles.
 +
                      "?" means the variant is not observed or monomorphic from the study.
 +
                      "!" means the variant observed from this study has different alleles from those in the first study.
  
* prefix.traitName.singlevar.cov.txt contains the LD matrix among a variant and the adjacent markers within a prefixed-sized window. The default window size is 1MB. It has the following format:
+
==== PLOTS====
    
+
 
   CHR   POS        VAR_POS_IN_WINDOW                            LD_MATRIX
+
'''RAREMETAL''' generates QQ plots and manhattan plots from single variant meta-analysis by default. Three QQ plots are generated, one with all variants included, one of variants with maf<0.05 and one of variants with maf<0.01. All plots are saved in a pdf file named yourPrefix.meta.plots.pdf. Genomic controls are also reported in the title of plots. When --correctGC option is specified, GC corrected plots are also generated.
   1   762320    762320,865628,865665,878744,879381,1560000    0.0359084,-0.000242112,-0.00125797,-0.000993422,-0.000344509,-0.00017077,
+
{| border="1" cellpadding="5" cellspacing="0" align="center"
  1  865628    865628,865665,878744,879381,1560000,1864659   0.419804,-0.0103663,-0.00635265,0.0594056,0.0534505,-0.00462183,
+
|-
   1  878744    878744,879381,1560000,1864659,1877659        0.000404537,-0.000235215,-1.4455e-05,-8.69137e-06,-3.1027e-05,
+
| align="center" width="100" | [[File:QQ.png]]
 +
|-
 +
| align="center" width="200" | [[File:Single_var_manhattan.png]]
 +
|}
 +
 
 +
=== Gene-level Tests Meta-Analysis Output ===
 +
 
 +
 
 +
==== LONG TABLES ====
 +
When --longOutput is used, output includes both burden test results of genes and single variant results of the variants included in burden tests. Here is an example of output file from SKAT when --longOutput is specified.  
 +
   ##Method=Burden
 +
  ##STUDY_NUM=2
 +
  ##TotalSampleSize=14308
 +
   #GROUPNAME      NUM_VAR VARs    MAFs   SINGLEVAR_EFFECTs      SINGLEVAR_PVALUEs      AVG_AF  MIN_AF  MAX_AF  EFFECT_SIZE    PVALUE
 +
   NOC2L  7      1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A        0.000166722,0.0242172,0.0109203,0.0355845,0.0333729,0.00700233,0.00200067      -0.183575,-0.00228307,-0.0598337,0.0220595,0.0229464,-0.0302768,-0.0200417      0.790161,0.953446,0.515806,0.503548,0.499251,0.791773,0.926625  0.0161807      0.000166722    0.0355845      0.00667875      0.662531
 +
   KLHL17  2      1:897285:A:G;1:898869:C:T      0.0148408,0.00108369    -0.0502034,-0.0256403  0.528269,0.934606      0.00796222      0.00108369      0.0148408      -0.0484494      0.528878
 +
 
 +
==== SHORT TABLES ====
 +
Otherwise, single variant results of variants included in burden tests will not be included in the output. Here is an example of output file from SKAT when --longOutput is not specified.
 +
 
 +
  ##Method=Burden
 +
  ##STUDY_NUM=2
 +
  ##TotalSampleSize=14308
 +
  #GROUPNAME      NUM_VAR VARs    AVG_AF  MIN_AF  MAX_AF  EFFECT_SIZE    PVALUE
 +
   NOC2L  7      1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A      0.0161807      0.000166722    0.0355845      0.00667875      0.662531
 +
   KLHL17  2      1:897285:A:G;1:898869:C:T      0.00796222      0.00108369      0.0148408      -0.0484494      0.528878
 +
 
 +
==== TABULATED HITS ====
 +
* When --tabulateHits is specified, top hits from Burden tests will be generated. Each method will have an individual tabulated file generated. The purpose of this tabulated file is to list burden test results of top hits together with single variant results from variants being grouped in burden tests. The difference between this file and the standard long-format output file from burden test is that each row of the file represents a single variant that is included in the gene for burden test. This format allows each sorting on users end.
 +
 
 +
* Tabulated top hits are saved in the file:
 +
  yourPrefix.meta.tophits.youMethod.tbl (example files names: TG.meta.tophits.burden.tbl, LDL.meta.tophits.SKAT.tbl)
 +
 
 +
* The following items are tabulated in the output:
 +
  GENE: Gene name.
 +
  METHOD: Burden test used.
 +
  GENE_PVALUE: P-value from gene-based burden tests.
 +
  MAF_CUTOFF: MAF cutoff used when doing gene-based tests.
 +
  ACTUAL_CUTOFF: Actual MAF cutoff used. (This will be different from MAF_CUTOFF only for Variable Threshold method.
 +
                Otherwise, it will be the same as MAF_CUTOFF.)
 +
  VAR: Variant name in CHR:POS:REF:ALT format.
 +
  MAF: Single variant pooled MAF from all samples.
 +
  EFFSIZE: Effect size from single variant meta analysis.
 +
  PVALUE: Pvalue from single variant meta analysis.
 +
 
 +
* An example of tabulated hits from a standard burden test with maf<0.05 as criterion is shown in the following:
 +
 
 +
  GENE    METHOD  GENE_PVALUE    MAF_CUTOFF      ACTUAL_CUTOFF  VARS    MAFS    EFFSIZES        PVALUES
 +
  PCSK9  BURDEN_0.050    7.54587e-11    0.05    0.05    1:55505647:G:T  0.0396631      -0.442192      2.10159e-46
 +
  PCSK9  BURDEN_0.050    7.54587e-11    0.05    0.05    1:55518371:G:A  0.0237138      0.0548733      0.430246
 +
  PCSK9  BURDEN_0.050    7.54587e-11    0.05    0.05    1:55529187:G:A  0.0433324      0.0946321      0.00129942
 +
  APOE    BURDEN_0.050    2.83457e-72    0.05   0.05    19:45412079:C:T 0.0413056      -0.554561      2.83457e-72
 +
 
 +
* According to the example above, PCSK9 had a p-value of 7.54587e-11 from the gene-based burden test, where three variants from this gene were included. Another hit from this meta analysis is APOE, where only one variant was included in the burden test.
 +
 
 +
==== PLOTS ====
 +
'''RAREMETAL''' generates QQ plots and manhattan plots from single variant and gene-level meta-analysis by default. Example QQ plots and manhattan plots are:
 +
{| border="1" cellpadding="5" cellspacing="0" align="center"
 +
|-
 +
| align="center" width="200" | [[File:manhattan.png]]
 +
|}
 +
 
 +
==== LOG ====
 +
 
 +
* A log file is automatically generated by '''RAREMETAL''' to save the parameters in effect. An example is in the following:
  
* An example log file is in the following:
 
  Summary statistics for trait LDL have been saved in LDL.singlevar.score.txt.
 
  LD matrices for trait LDL have been saved in LDL.singlevar.cov.txt.
 
 
 
  Rare-Metal-Worker handled all individuals as related.
 
 
 
 
   The following parameters are in effect:
 
   The following parameters are in effect:
 
    
 
    
   Input Files:
+
   List of Studies:
 
   ============================
 
   ============================
   --ped [APOE.ped]
+
   --studyName [studyName.SardiNia]
  --dat [APOE.dat]
 
  --vcf []
 
 
    
 
    
   Output Files:
+
   Grouping Methods:
 
   ============================
 
   ============================
   --prefix []
+
   --groupFile [genes.file]
   --LDwindow [1000000]
+
   --annotatedVcf []
 +
  --annotation []
 +
  --writeVcf [OFF]
 
    
 
    
   VC Options:
+
   QC Options:
 
   ============================
 
   ============================
   --vcShared [false]
+
   --hwe [0]
   --vcX [false]
+
   --callRate [0]  
  --useCovariates [false]
 
 
    
 
    
   Trait Options:
+
   Association Methods:
 
   ============================
 
   ============================
   --makeResiduals [true]
+
   --burden [true]
   --inverseNormal [true]
+
   --MB [false]
   --traitName [LDL]
+
  --SKAT [false]
 +
  --VT [false]
 +
   --condition [condition.file]
 
    
 
    
   Kinship Source:
+
   Other Options:
 
   ============================
 
   ============================
   --kinPedigree [true]
+
   --tabix [OFF]
   --kinGeno [false]
+
   --correctGC [ON]
   --kinFile []
+
   --prefix [test]
   --kinSave [false]
+
   --maf [0.05]
 
+
   --longOutput [false]
  Kinship Options:
+
   --tabulateHits [false]
  ============================
+
   --hitsCutoff [1e-06]
   --kinMaf [0.05]
+
   --dosage [false]
   --kinMiss [0.05]
+
   --altMAF [false]
    
 
  Chromosome X:
 
  ============================
 
  xLabel [X]
 
   xStart [2699520]
 
   xEnd [154931044]
 
  
== Examples ==
+
==Example Command lines==
  
=== Related individuals ===
+
* Here is an example command line to do single variant meta analysis only:
* When you have genotype stored in ped file and dat file, and want to use pedigree kinship and inverse normalize trait values before adjusting any covariates and doing analysis:
+
  ./raremetal --summaryFiles your.list.of.summary.files --prefix yourPrefix
  
   /bin/raremetalworker --ped yourInput.ped --dat yourInput.dat --traitName LDL --inverseNormal --useCovariates
+
* When you want to do all burden tests using a group file to specify which variants to group:
 +
   ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix
 +
  (NOTE: this will generate single variant meta analysis result and the short format output for burden test results.)
  
* When you have genotype stored in ped file and dat file, and want to use pedigree kinship and adjust covariates before inverse normalizing the residuals and doing further analysis:
+
* Here is how to do all SKAT meta analysis using a group file and request a long format output together with tabulated hits:
 
+
   ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --prefix yourPrefix
   /bin/raremetalworker --ped yourInput.ped --dat yourInput.dat --traitName LDL --useCovariates --makeResiduals --inverseNormal
 
 
 
* When you have genotype stored in ped file and dat file, and want to use kinship generated from genotypes:
 
 
 
  /bin/raremetalworker --ped yourInput.ped --dat yourInput.dat --kinGeno --kinSave --traitName LDL (--kinSave allows you to save kinship matrix for future use; it is optional.)
 
  
* When you have genotype stored in vcf file and want to use pedigree kinship:
+
* Here is an example of adding QC filters to variants when doing meta analysis.
 +
  ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
  
   /bin/raremetalworker --ped yourInput.ped --dat yourInput.dat --vcf yourInput.vcf.gz
+
* Here is how to do the same thing but reading grouping information from an annotated VCF file:
 +
   ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --annotatedVcf your.annotated.vcf --annotation nonsyn/stop/splicing --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
  
* When you have genotype stored in vcf file and want to use kinship generated from genotype:  
+
* If you want to write a VCF file of pooled variants from all studies, annotate them using your favorite annotation program, and then come back to '''RAREMETAL''' with the annotate VCF file to do burden tests:
 +
  First, use the following command to write the VCF file:
 +
  ./raremetal --summaryFiles your.list.of.summary.files --writeVcf --prefix yourPrefix
 +
  Second, annotate the VCF file using your favorite annotation program. (Annotated VCF file has to follow the format described here: [[Rare-Metal#Group_Rare_Variants_from_Annotated_VCF|annotated VCF format]])
 +
  Third, use the following command to do meta analysis:
 +
  ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --annotatedVcf your.annotated.vcf --annotation nonsyn/splicing/stop --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix
  
  /bin/raremetalworker --ped yourInput.ped --dat yourInput.dat --vcf yourInput.vcf.gz --kinGeno --kinSave (--kinSave allows you to save kinship matrix for future use.)
+
==Other Useful Info==
  
=== Unrelated individuals ===
+
* Summary specs can be found [[Summary Files Specification for RAREMETAL]]
  
* Commands are the same as in above example, except each individual has to have a distinct family ID in PED file, and their father and mother ids should be "0".
+
==TUTORIAL==
* When you have genotypes from ped and marker information from dat file, and assuming no relatedness in the sample:
+
* For a comprehensive tutorial of RAREMETALWORKER and RAREMETAL using example data sets, please go to the following:
  
   ./raremetalworker --ped yours.ped --dat yours.dat
+
   [http://genome.sph.umich.edu/wiki/Tutorial:_RareMETAL '''RAREMETAL Tutorial''']
  
* When you have genotypes from vcf and covariates and trait information saved in ped and dat file, assuming there is no relatedness in the sample, you should use the following:
+
* For a brief tutorial of rvtests, please go to:
  
   ./raremetalworker --ped yours.ped --dat yours.dat --vcf yours.vcf.gz
+
   [http://genome.sph.umich.edu/wiki/Rvtests '''rvtests''']
  
* When you have genotypes from vcf and covariates and trait information saved in ped and dat file, assuming there is cryptic relatedness in the sample, you should use the following:
+
==CONTACT==
  
  ./raremetalworker --ped yours.ped --dat yours.dat --vcf yours.vcf.gz --kinGeno (# this will handle individuals as related, and generate kinship matrix from genotype.)
+
Please email Andy Boughton (abought at umich dot edu) for questions.
  
== Q & A ==
+
Also check  [[Raremetal Incoming updates | '''Known issues and incoming update in next version''']] to see if your problem has been reported before

Latest revision as of 13:22, 20 May 2019

Contents

Useful Wiki Pages

The rvtests tool for rare-variant association analysis can also generate output compatible with RAREMETAL.

Brief Description

RAREMETAL is a computationally efficient tool for meta-analysis of rare variants using sequencing or genotyping array data. It takes summary statistics and LD matrices generated by RAREMETALWORKER or rvtests, handles related and unrelated individuals, and supports both single variant and burden meta-analysis. It generates high quality plots by default and has options that allow users to build reports at different levels.

RAREMETAL is developed by Shuang Feng, Dajiang Liu and Gonçalo Abecasis. A R-package written by Dajiang Liu using the same methodology is available.

Key Features

RAREMETAL has the following features:

  • Performs gene-based or region-based meta analysis using Burden tests with the following methods: CMC_counts, Madsen-Browning, SKAT, and Variable Threshold.
  • Performs single variant metal-analysis by default.
  • Allows customized groups of variants to be tested.
  • Allows conditional analysis to be performed in both gene-level meta-analysis and single variants meta-analysis.
  • Generate QQ plots and manhattan plots by default.

Approach

The key idea behind meta-analysis with RAREMETAL is that various gene-level test statistics can be reconstructed from single variant score statistics and that, when the linkage disequilibrium relationships between variants are known, the distribution of these gene-level statistics can be derived and used to evaluate signifi-cance. Single variant statistics are calculated using the Cochran-Mantel-Haenszel method. Our method has been published in Liu et. al in Nature Genetics. Please go to method for details.

Download and Installation

We have tested compilation using our source code on several platforms including Linux, and Mac OS X.

For source code and executables together with instructions of building from source, please go to DOWNLOAD source and executables.

For questions about compilation, please go to FAQ.

Basic Usage Instructions

RAREMETAL is a command line tool. It is typically run from a Linux or Unix prompt by invoking the command raremetal. In the following are descriptions of basic usage for meta analysis. A detailed TUTORIAL with toy data are also available.

Prepare Input Files

RAREMETAL requires the following basic input files: summary statistics and covariance matrices of score statistics generated by RAREMETALWORKER or rvtests, a file with list of studies to be included and a group file if gene-level meta-analysis is expected.


Summary Statistics

Files containing summary statistics and LD matrices generated by RAREMETALWORKER should be compressed and tabix indexed using the following commands (Note in RAREMETALWORKER, if --zip is specified, these .gz and .tbi files will be automatically generated):

bgzip study1.singlevar.score.txt
tabix -s 1 -b 2 -e 2 -c "#" study1.singlevar.score.txt.gz
bgzip study1.singlevar.cov.txt
tabix -s 1 -b 2 -e 2 -c "#" study1.singlevar.cov.txt.gz

Files containing summary statistics and LD matrices generated by rvtests should be compressed and tabix indexed using the following commands:

bgzip study1.MetaScore.assoc
tabix -s 1 -b 2 -e 2 -S 1 study1.MetaScore.assoc.gz
tabix -s 1 -b 2 -e 2 -S 1 study1.MetaCov.assoc.gz
List of Studies
  • --summaryFiles option is crucial for RAREMETAL to work. Ignoring this option would lead to FATAL ERROR and RAREMETAL would stop.
  • The file should contain the path and prefix of the studies you want to include.
  • If there is one or more studies that you want to excluded from your list, but want to save some effort of generating a new file, you can put a "#" in front of the line of record. RAREMETAL would automatically exclude that study from meta analysis. An example list of summary file is in the following:
 /net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/TwinsUK/TwinsUK.TG.singlevar.score.txt.gz
 #/net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/HUNT/RareMetalWorker/HUNT_MI_case.TG.singlevar.score.txt.gz
  • When gene-level analysis is requested, --covFiles option should be used to specify the covariance files. An example file is:
 /net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/TwinsUK/TwinsUK.TG.singlevar.cov.txt.gz
 #/net/fantasia/home/sfengsph/prj/raremetal/raremetal/bin/META/HUNT/RareMetalWorker/HUNT_MI_case.TG.singlevar.cov.txt.gz
  • The above example study name file guides RAREMETAL to look for summary statistics from TwinsUK study only, because "HUNT" study is commented out. The following two files are needed for RAREMETAL to perform further analysis together with their tabix index file are needed.
  • Please sepcify --dosage option if input files were generated from dosage instead of genotype.
Group Rare Variants
From a Group File
  • Grouping methods are only necessary when doing gene-based or group-based burden tests in meta-analysis.
  • If none of the grouping method is specified, then only single variant meta-analysis will be performed.
  • With --groupFile option, you can specify particular set of variants to be grouped for burden tests.
  • The group file must be a tab or space delimited file in the following format:
 GROUP_ID MARKER1_ID MARKER2_ID MARKER3_ID ... 
  • MARKER_ID must be in the following format:
 CHR:POS:REF:ALT
  • An example group file is:
 PLEKHN1 1:901922:G:A    1:901923:C:A    1:902088:G:A    1:902128:C:T    1:902133:C:G    1:902176:C:T    1:905669:C:G        
 HES4    1:934735:A:C    1:934770:G:A    1:934801:C:T    1:935085:G:A    1:935089:C:G
 ISG15   1:949422:G:A    1:949491:G:A    1:949502:C:T    1:949608:G:A    1:949802:G:A    1:949832:G:A
 AGRN    1:970687:C:T    1:976963:A:G    1:977028:G:T    1:977356:C:T    1:977396:G:A    1:978628:C:T    1:978645:G:A             
 C1orf159        1:1021285:G:T   1:1021302:T:C   1:1021315:A:C   1:1021386:G:A   1:1022534:C:T   1:1025751:C:T   1:1026913:C:T
From an Annotated VCF File

If --groupFile option is NOT specified, RAREMETAL will look for an annotated vcf file as blue print for variants to group. Users are also allowed to generate a vcf file based on the superset of variants from pooled samples, and annotate outside RAREMETAL. Then, annotated vcf file can be used as input for RAREMETAL for gene-level meta-analysis, or group files can be generated based on the annotated vcf file. Detailed description of these options are available. There are also examples of this usage at the bottom of this page.

QC Options

  • RAREMETAL allows filtering of variants from individual studies by their HWE pvalue and call rate, which are generated as part of the output from RAREMETALWORKER or rvtests.
  • To filter by HWE p-values, --hwe option should be used. The default is 0.0, which means not filtering any of the variants.
  • To filter by call rate, --callRate option can be specified. The default is 0.0, which allows no filtering utilized.

Association Options

  • Currently, CMC type burden test, Madsen-Browning burden test, Variable Threshold burden test and SKAT are provided in RAREMETAL, by specifying --burden, --MB, --VT and --SKAT.
  • --maf specifies the minor allele frequency cutoff when doing gene-based or group-based burden tests. Variants with maf above this threshold will be ignored. The default is maf<0.05.
  • In a single study of sample size N, if a site is monomorphic or not reported in vcf/ped, it is considered that the sample size of this study is not large enough to sample the rare allele. Thus, this study contributes 2*N reference alleles and 0 alternative allele towards meta-analysis. To let such studies contribute no alleles towards pooled allele frequency, specify --altMAF.

Conditional Analysis

  • To decide whether a signal is caused by shadowing a significant common variant nearby, RAREMETAL also enables conditional analysis with a list of variants to be conditioned upon provided in a file as input for --condition option. An example input file should be space or tab delimited as in the following. When alleles do not match the ref and alt alleles from samples, the variant will be skipped from conditional analysis.
1:861349:C:T 1:905901:G:A 20:986998:G:C 22:3670691:A:G

Additional Analysis Options

Generate a VCF File to Annotate Outside RAREMETAL

  • --writeVCF allows user to write a VCF file including pooled single variants from all studies. Then users can use their favorite annotation tool to annotate the VCF file. After annotating the VCF file, users can use that file as input for RAREMETAL for further gene-based or region-based meta analysis.
  • The output vcf file will be name as: yourPrefix.pooled.variants.vcf. An example output vcf file is in the following:
 #CHROM    POS     ID      REF     ALT     QUAL    FILTER  INFO
 1       115658497       115658497       G       A       .       .       ALT_AF=0.380906;
 2       74688884        74688884        G       A       .       .       ALT_AF=8.33611e-05;
 3       121414217       121414217       C       A       .       .       ALT_AF=0.0747833;

Annotation

  • RAREMETAL automatically recognizes the annotation format generated by ANNO or EPACTS.
  • To annotate a the VCF generated in previous step, you can use the following command:
./anno --in your.in.vcf.gz --out your.out.vcf.gz

Group Rare Variants from Annotated VCF

  • If --groupFile option is NOT specified, RAREMETAL will look for an annotated vcf file as blue print for variants to group.
  • The annotated VCF file should be specified using --annotatedVcf option.
  • --annotation should be used with --annotatedVcf together when specific category of functional variants are of interest to be grouped. For example, if grouping nonsynonymous and splicing variants are of interests, the following should be included in command line:
  • (! only available after v4.13.8) when --annotation is not specified, raremetal groups all non-intergenic variants.
 --annotatedVcf your.annotated.vcf --annotation nonsyn/splicing
 Note: this allows you to group variants that are annotated starting with nonsyn or splicing (not case-sensitive).
  • Special format for the annotated VCF file is required: all annotation information should be coded in INFO field in VCF file, starting with the key "ANNO=". An example annotated VCF file is in the following:
 #CHROM    POS     ID      REF     ALT     QUAL    FILTER  INFO
 1       19208194        .       G       A       100     PASS      
 AC=3;ANNO=nonsynonymous:ALDH4A1:NM_170726:exon8:c.C866T:p.P289L,ALDH4A1:NM_001161504:exon8:c.C686T:p.P229L,ALDH4A1:NM_003748:exon8:c.C866T:p.P289L,;
 ANNO=splicing:ALDH4A1
 1       19208293        .       G       C       100     PASS    AC=7;STUDIES=5;MAC=7;MAF=0.001;DESIGN=TBD_ASSAY;DSCORE=1.00;
 ANNO=nonsynonymous:ALDH4A1:NM_170726:exon8:c.C767G:p.P256R,ALDH4A1:NM_001161504:exon8:c.C587G:p.P196R,ALDH4A1:NM_003748:exon8:c.C767G:p.P256R,
  • Notice that each variant is allowed to have more than one annotations; but each annotation should start with a new key "ANNO=" followed by annotation:genename:other transcript information.
  • Generated group file will be named test.groupfile under your running directory.

Options for Report Generation

  • --correctGC generates QQ plots and manhattan plots with pvalues corrected using genomic control.
  • --prefix allows customized prefix for output files.
  • --longOutput allows users to output not only burden test results but also the single variant results (allele frequencies, effect sizes, and p-values) for the variants being grouped together. Please refer to the output files section for detailed explanation and examples.
  • --tabulateHits works with --hitsCutoff together to generate reports for genes that have p-value less than specified cutoff from burden tests or SKAT. The default cutoff of p-value for genes to be reported is 1.0e-06, which can be specified by --hitsCutoff option. For more explanations and examples, please go to Tabulated Hits.

Miscellaneous Options

  • --tabix allows rapid analysis when number of groups/genes of interests are small. Currently, when number of groups is less than 100, --tabix option is automatically turned on.

Reports Generated by RAREMETAL

Single Variant Meta Analysis Output

TABLES

  • Single variant meta analysis output has the following components: header, results and footnote.
  • Header lines start with "##" shows summary of the meta analysis including method used, number of studies, and total sample size.
  • Header line starts with "#" are column headers for results table.
  • Footnote also starts with "#", where genomic controls from each study and the overall sample are reported.
  • An example single variant meta analysis output is shown below:
 ##Method=SinglevarScore
 ##STUDY_NUM=2
 ##TotalSampleSize=14308
 #CHROM  POS     REF     ALT     POOLED_ALT_AF   EFFECT_SIZE     DIRECTION_BY_STUDY      PVALUE
 1       115658497       G       A       0.380906        0.00954332      ++      0.45828
 2       74688884        G       A       8.33611e-05     -0.196387       -!      0.845372
 3       121414217       C       A       0.0747833       0.0216982       -+      0.34453
 6       137245814       G       C       0.000803746     0.105693        ++      0.601805
  • A detailed explanation of each column is in the following:
 CHROM:              Chromosome Name
 POS:                Variant Position
 REF:                Reference Allele Label
 ALT:                Alternative Allele Label
 POOLED_ALT_AF:      Pooled Alternative Allele Frequency
 EFFECT_SIZE:        Alternative Allele Effect Size
 DIRECTION_BY_STUDY: Effect size direction of alternative allele from each study. 
                     The order of study is consistent with the order of studies listed in the input file for option --summaryFiles. 
                     "?" means the variant is not observed or monomorphic from the study. 
                     "!" means the variant observed from this study has different alleles from those in the first study.

PLOTS

RAREMETAL generates QQ plots and manhattan plots from single variant meta-analysis by default. Three QQ plots are generated, one with all variants included, one of variants with maf<0.05 and one of variants with maf<0.01. All plots are saved in a pdf file named yourPrefix.meta.plots.pdf. Genomic controls are also reported in the title of plots. When --correctGC option is specified, GC corrected plots are also generated.

QQ.png
Single var manhattan.png

Gene-level Tests Meta-Analysis Output

LONG TABLES

When --longOutput is used, output includes both burden test results of genes and single variant results of the variants included in burden tests. Here is an example of output file from SKAT when --longOutput is specified.

 ##Method=Burden
 ##STUDY_NUM=2
 ##TotalSampleSize=14308
 #GROUPNAME      NUM_VAR VARs    MAFs    SINGLEVAR_EFFECTs       SINGLEVAR_PVALUEs       AVG_AF  MIN_AF  MAX_AF  EFFECT_SIZE     PVALUE
 NOC2L   7       1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A        0.000166722,0.0242172,0.0109203,0.0355845,0.0333729,0.00700233,0.00200067       -0.183575,-0.00228307,-0.0598337,0.0220595,0.0229464,-0.0302768,-0.0200417      0.790161,0.953446,0.515806,0.503548,0.499251,0.791773,0.926625  0.0161807       0.000166722     0.0355845       0.00667875      0.662531
 KLHL17  2       1:897285:A:G;1:898869:C:T       0.0148408,0.00108369    -0.0502034,-0.0256403   0.528269,0.934606       0.00796222      0.00108369      0.0148408       -0.0484494      0.528878

SHORT TABLES

Otherwise, single variant results of variants included in burden tests will not be included in the output. Here is an example of output file from SKAT when --longOutput is not specified.

 ##Method=Burden
 ##STUDY_NUM=2
 ##TotalSampleSize=14308
 #GROUPNAME      NUM_VAR VARs    AVG_AF  MIN_AF  MAX_AF  EFFECT_SIZE     PVALUE
 NOC2L   7       1:880502:C:T;1:881918:G:A;1:887799:C:T;1:888659:T:C;1:889238:G:A;1:891591:C:T;1:892380:G:A      0.0161807       0.000166722     0.0355845       0.00667875      0.662531
 KLHL17  2       1:897285:A:G;1:898869:C:T       0.00796222      0.00108369      0.0148408       -0.0484494      0.528878

TABULATED HITS

  • When --tabulateHits is specified, top hits from Burden tests will be generated. Each method will have an individual tabulated file generated. The purpose of this tabulated file is to list burden test results of top hits together with single variant results from variants being grouped in burden tests. The difference between this file and the standard long-format output file from burden test is that each row of the file represents a single variant that is included in the gene for burden test. This format allows each sorting on users end.
  • Tabulated top hits are saved in the file:
 yourPrefix.meta.tophits.youMethod.tbl (example files names: TG.meta.tophits.burden.tbl, LDL.meta.tophits.SKAT.tbl)
  • The following items are tabulated in the output:
 GENE: Gene name.
 METHOD: Burden test used.
 GENE_PVALUE: P-value from gene-based burden tests.
 MAF_CUTOFF: MAF cutoff used when doing gene-based tests.
 ACTUAL_CUTOFF: Actual MAF cutoff used. (This will be different from MAF_CUTOFF only for Variable Threshold method.
                Otherwise, it will be the same as MAF_CUTOFF.)
 VAR: Variant name in CHR:POS:REF:ALT format.
 MAF: Single variant pooled MAF from all samples.
 EFFSIZE: Effect size from single variant meta analysis. 
 PVALUE: Pvalue from single variant meta analysis.
  • An example of tabulated hits from a standard burden test with maf<0.05 as criterion is shown in the following:
 GENE    METHOD  GENE_PVALUE     MAF_CUTOFF      ACTUAL_CUTOFF   VARS    MAFS    EFFSIZES        PVALUES
 PCSK9   BURDEN_0.050    7.54587e-11     0.05    0.05    1:55505647:G:T  0.0396631       -0.442192       2.10159e-46
 PCSK9   BURDEN_0.050    7.54587e-11     0.05    0.05    1:55518371:G:A  0.0237138       0.0548733       0.430246
 PCSK9   BURDEN_0.050    7.54587e-11     0.05    0.05    1:55529187:G:A  0.0433324       0.0946321       0.00129942
 APOE    BURDEN_0.050    2.83457e-72     0.05    0.05    19:45412079:C:T 0.0413056       -0.554561       2.83457e-72
  • According to the example above, PCSK9 had a p-value of 7.54587e-11 from the gene-based burden test, where three variants from this gene were included. Another hit from this meta analysis is APOE, where only one variant was included in the burden test.

PLOTS

RAREMETAL generates QQ plots and manhattan plots from single variant and gene-level meta-analysis by default. Example QQ plots and manhattan plots are:

Manhattan.png

LOG

  • A log file is automatically generated by RAREMETAL to save the parameters in effect. An example is in the following:
 The following parameters are in effect:
 
 List of Studies:
 ============================
 --studyName [studyName.SardiNia]
 
 Grouping Methods:
 ============================
 --groupFile [genes.file]
 --annotatedVcf []
 --annotation []
 --writeVcf [OFF]
 
 QC Options:
 ============================
 --hwe [0]
 --callRate [0] 
 
 Association Methods:
 ============================
 --burden [true]
 --MB [false]
 --SKAT [false]
 --VT [false]
 --condition [condition.file]
 
 Other Options:
 ============================
 --tabix [OFF]
 --correctGC [ON]
 --prefix [test]
 --maf [0.05]
 --longOutput [false]
 --tabulateHits [false]
 --hitsCutoff [1e-06]
 --dosage [false]
 --altMAF [false]

Example Command lines

  • Here is an example command line to do single variant meta analysis only:
 ./raremetal --summaryFiles your.list.of.summary.files --prefix yourPrefix 
  • When you want to do all burden tests using a group file to specify which variants to group:
 ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix
 (NOTE: this will generate single variant meta analysis result and the short format output for burden test results.)
  • Here is how to do all SKAT meta analysis using a group file and request a long format output together with tabulated hits:
 ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --prefix yourPrefix
  • Here is an example of adding QC filters to variants when doing meta analysis.
 ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --groupFile your.groupfile --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
  • Here is how to do the same thing but reading grouping information from an annotated VCF file:
 ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --annotatedVcf your.annotated.vcf --annotation nonsyn/stop/splicing --SKAT --longOutput --tabulateHits --hitsCutoff 1.0e-07 --hwe 1e-06 --callRate 0.98 --prefix yourPrefix
  • If you want to write a VCF file of pooled variants from all studies, annotate them using your favorite annotation program, and then come back to RAREMETAL with the annotate VCF file to do burden tests:
 First, use the following command to write the VCF file:
 ./raremetal --summaryFiles your.list.of.summary.files --writeVcf --prefix yourPrefix
 Second, annotate the VCF file using your favorite annotation program. (Annotated VCF file has to follow the format described here: annotated VCF format)
 Third, use the following command to do meta analysis:
 ./raremetal --summaryFiles your.list.of.summary.files --covFiles your.list.of.cov.files --annotatedVcf your.annotated.vcf --annotation nonsyn/splicing/stop --burden --MB --SKAT --VT --maf 0.01 --prefix yourPrefix

Other Useful Info

TUTORIAL

  • For a comprehensive tutorial of RAREMETALWORKER and RAREMETAL using example data sets, please go to the following:
 RAREMETAL Tutorial
  • For a brief tutorial of rvtests, please go to:
 rvtests

CONTACT

Please email Andy Boughton (abought at umich dot edu) for questions.

Also check Known issues and incoming update in next version to see if your problem has been reported before