Changes

From Genome Analysis Wiki
Jump to navigationJump to search
23,807 bytes added ,  20:49, 2 March 2013
Created page with ''''Table of Contents''' <!--*generated with [DocToc](http://doctoc.herokuapp.com/)*--> * Introduction * Download * [[#quick-tutorial|Quick Tutori…'
'''Table of Contents''' <!--*generated with [DocToc](http://doctoc.herokuapp.com/)*-->

* [[#introduction|Introduction]]
* [[#download|Download]]
* [[#quick-tutorial|Quick Tutorial]]
** [[#single-variant-tests|Single variant tests]]
** [[#groupwise-tests|Groupwise tests]]
** [[#related-individual-tests|Related individual tests]]
** [[#meta-analysis-tests|Meta-analysis tests]]
* [[#input-files|Input files]]
** [[#genotype-file-vcf|Genotype file (VCF)]]
** [[#phenotype-file|Phenotype file]]
** [[#covariate-file|Covariate file]]
** [[#trait-transformation|Trait transformation]]
* [[#models|Models]]
** [[#single-variant-tests-1|Single variant tests]]
** [[#burden-tests|Burden tests]]
** [[#variable-threshold-models|Variable threshold models]]
** [[#kernel-models|Kernel models]]
** [[#utility-models|Utility models]]
* [[#association-test-options|Association test options]]
** [[#sample-inclusionexclusion|Sample inclusion/exclusion]]
** [[#variant-site-filters|Variant site filters]]
** [[#genotyep-filters|Genotyep filters]]
** [[#handle-missing-genotypes-and-phenotypes|Handle missing genotypes and phenotypes]]
** [[#specify-groups-eg-burden-unit|Specify groups (e.g burden unit)]]
* [[#contact|Contact]]

= Introduction =

Rvtests, which stands for Rare Variant tests, is a flexible software package for genetic association studies. It is designed to support unrealted individual or related (family-based) individuals. Both quantitative trait and binary trait are supported. It includes a variety of association tests (e.g. single variant score test, burden test, variable threshold test, SKAT test, fast linear mixed model score test). It takes [http://www.1000genomes.com/ VCF] format as genotype input file and takes PLINK format phenotype file and covariate file. From our practice, it is capable to analyze 8,000 related individuals using less than 400 Mb memory.

= Download =

Source file can be downloaded from [https://github.com/zhanxw/rvtests/archive/master.zip github] or [https://github.com/zhanxw/rvtests github page]. Executable binary file can be downloaded from [http://www.sph.umich.edu/csg/zhanxw/software/rvtests/rvtests-latest.tar.gz here].

= Quick Tutorial =

Here is a quick example of how to use ''rvtests'' software in typical use cases.

== Single variant tests ==

<pre>rvtests --inVcf input.vcf --pheno phenotype.ped --out output --single wald,score</pre>
This specifies single variant Wald and score test for association tests for every variant in the <code>input.vcf</code> file. The 6th column of the phenotype file, <code>phenotype.ped</code>, which is in PLINK format, is used. Rvtests will automatically check whether the phenotype is binary trait or quantitative trait.

For other types of association tests, you can refer to [[#Models|Models]]

== Groupwise tests ==

Groupwise tests includes three major kinds of tests.

* Burden tests: group variants, which are usually less than 1% or 5% rare variants, for association tests. The category includes: CMC test, Zeggini test, Madsen-Browning test, CMAT test, and rare-cover test.
* Variable threshold tests: group variants under different frequency thresholds.
* Kernel methods: suitable to tests rare variants having different directions of effects. These includes SKAT test and KBAC test.

All above tests requires to group variants into a unit. The simplist case is to use gene as grouping unit. For different grouping method, see [[#Grouping|Grouping]].

To perform rare variant tests by gene, you need to use <code>--geneFile</code> to specify the gene range in a refFlat format. We provided different gene definitions in the [[#Resources|Resources]] section. You can use <code>--gene</code> to specify which gene(s) to test. For example, specify <code>--gene CFH,ARMS2</code> will perform association tests on CFH and ARMS2 genes. If there is no providing <code>--gene</code> option, all genes will be tests.

The following command line demonstrate how to use CMC method, variable threshold method(proposed by Price) and kernel based method (SKAT by Shawn Lee and KBAC by Dajiang Liu) to test every gene listed in ''refFlat_hg19_uniq_gene.txt.gz''.

<pre>rvtests --inVcf input.vcf --pheno phenotype.ped --out output --geneFile refFlat_hg19_uniq_gene.txt.gz --burden cmc --vt price --kernel skat,kbac</pre>
== Related individual tests ==

To test related individuals, you will need to first create a kinship matrix:

<pre>vcf2kinship --inVcf input.vcf --bn --out output</pre>
The option <code>--bn</code> means calculating empirical kinship using Balding-Nicols method. You can specifiy <code>--ibs</code> to obtain IBS kinship or use <code>--pedigree input.ped</code> to calculate kinship from known pedigree information.

Then you can use linear mixed model based association tests such as Fast-LMM score test, Fast-LMM LRT test and Grammar-gamma tests. An exemplar command is shown:

<pre>rvtests --inVcf input.vcf --pheno phenotype.ped --out output --kinship output.kinship --single famScore,famLRT,famGrammarGamma</pre>
== Meta-analysis tests ==

The meta-analysis models outputs association test results and genotype covariance matrix. These statistics can be used in rare variant association analysis. We provide single variant score test and generate genotype covariance matrix. You can use command:

<pre>rvtests --inVcf input.vcf --pheno phenotype.ped --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype --meta score,cov --out output</pre>
Here the <code>--covar</code> specify a covariate file, and <code>--covar-name</code> specify which covariates can used in the analysis. Covariate file format can be found [[#Covariate%20file|here]]. <code>--inverseNormal --useResidualAsPhenotype</code> specifies trait transformation method. That means first fit a regression model of the phenotype on covariates (intercept automatically added), then the residuals are inverse normalized. Trait transformation details can be found [[#Trait%20transformation|here]].

= Input files =

== Genotype file (VCF) ==

Rvtests supports VCF (Variant Call Format) files. Files in both plain txt format or gzipped format are supported. To use group-based rare variant tests, indexed the VCF files using [http://samtools.sourceforge.net/tabix.shtml tabix] are required.

Here are the commands to convert plain text format to bgzipped VCF format:

<pre>(grep ^&quot;#&quot; $your_old_vcf; grep -v ^&quot;#&quot; $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c &gt; $your_vcf_file
tabix -f -p vcf $your_vcf_file</pre>
The above commands will (1) remove the <code>chr</code> prefix from chromosome names; (2) sort VCF files by chromosome first, then by chromosomal positions; (3) compress using bgzip; (4) create tabix index.

== Phenotype file ==

You can use <code>--mpheno $phenoypeColumnNumber</code> or <code>--pheno-name</code> to specify a given phenotype.

An example phenotype file, (<code>example.pheno</code>), has the following format:

<pre>fid iid fatid matid sex y1 y2 y3 y4
P1 P1 0 0 0 1.7642934435605 -0.733862638327895 -0.980843608339726 1
P2 P2 0 0 0 0.457111744989746 0.623297281416372 -2.24266162284447 0
P3 P3 0 0 0 0.566689682543218 1.44136462889459 -1.6490100777089 0
P4 P4 0 0 0 0.350528353203767 -1.79533911725537 -1.11916876241804 0
P5 P5 0 0 1 2.72675074738545 -1.05487747371158 -0.33586430010589 1</pre>
Phenotype file is specified by the option <code>--pheno example.pheno</code> . The default phenotype column header is “<code>y1</code>”. If you want to use alternative columns as phenotype for association analysis (e.g the column with header y2), you may specific the header names using either

* --mpheno 2
* --pheno-name y2

'''NOTE:''' to use “<code>--pheno-name</code>”, the header line must starts with “<code>fid iid</code>” as PLINK requires.

In phenotype file, missing values can be denoted by NA or any non-numeric values. Individuals with missing phenotypes will be automatically dropped from subsequent association analysis. For each missing phenotype value, a warning will be generated and recorded in the log file.

== Covariate file ==

You can use <code>--covar</code> and <code>--covar-name</code> to specify covariates that will be used for single variant association analysis. This is an optional parameter. If you do not have covariate in the data, this option can be ignored.

The covariate file, (e.g. <code>example.covar</code>) has a similar format as the phenotype file:

<pre>fid iid age bmi pc1 pc2 pc3
P1 P1 23 24.2 1.7642934435605 -0.733862638327895 -0.980843608339726
P2 P2 32 29.0 0.457111744989746 0.623297281416372 -2.24266162284447
P3 P3 44 32.4 0.566689682543218 1.44136462889459 -1.6490100777089
P4 P4 25 28.2 0.350528353203767 -1.79533911725537 -1.11916876241804
P5 P5 30 19.8 2.72675074738545 -1.05487747371158 -0.33586430010589</pre>
The covariate file is specified by the <code>--covar</code> option (e.g. <code>--covar example.covar</code>). To specify covariates that will be used in the association analysis, the option <code>--covar-name</code> can be used. For example, when age, bmi and 3 PCs are used for association analysis, the following option can be specified for the rvtest program, i.e. <code>--covar example.covar --covar-name age,bmi,pc1,pc2,pc3</code>.

Note: Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA). They will be automatically imputed to the mean value in the data file.

== Trait transformation ==

In this meta-analysis, we use inversed normal transformed residuals in the association analysis, which is achieved by using a combination of <code>--inverseNormal</code> and <code>--useResidualAsPhenotype</code>. Specifically, we first fit the null model by regressing phenotype on covariates. The residuals are then inverse normal transformed (see Appendix A more detailed formulae for transformation). Transformed residuals will be used to obtain score statistics.

In meta analysis, an exemplar command for using rvtest looks like the following:

<pre>./rvtest --inVcf $vcf --pheno $example.pheno --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype --meta score,cov --out $output_prefix </pre>
= Models =

Rvtests support various association models.

== Single variant tests ==

<table>
<thead>
<tr class="header">
<th align="left">Single variant</th>
<th align="center">Model(*)</th>
<th align="center">Traits(#)</th>
<th align="center">Covariates</th>
<th align="center">Related / unrelated</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Score test</td>
<td align="center">score</td>
<td align="center">B, Q</td>
<td align="center">Y</td>
<td align="center">U</td>
<td align="left">Only null model is used to performed the test</td>
</tr>
<tr class="even">
<td align="left">Wald test</td>
<td align="center">wald</td>
<td align="center">B, Q</td>
<td align="center">Y</td>
<td align="center">U</td>
<td align="left">Only fit alternative model, and effect size will be estimated</td>
</tr>
<tr class="odd">
<td align="left">Exact test</td>
<td align="center">exact</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Fisher's test</td>
</tr>
<tr class="even">
<td align="left">Fam LRT</td>
<td align="center">famLRT</td>
<td align="center">Q</td>
<td align="center">Y</td>
<td align="center">R, U</td>
<td align="left">Fast-LMM model</td>
</tr>
<tr class="odd">
<td align="left">Fam Score</td>
<td align="center">famScore</td>
<td align="center">Q</td>
<td align="center">Y</td>
<td align="center">R, U</td>
<td align="left">Fast-LMM model style likelihood ratio test</td>
</tr>
<tr class="even">
<td align="left">Grammar-gamma</td>
<td align="center">famGrammarGamma</td>
<td align="center">Q</td>
<td align="center">Y</td>
<td align="center">R, U</td>
<td align="left">Grammar-gamma method</td>
</tr>
</tbody>
</table>

(*) Model columns list the regconized names in rvtests. For example, use <code>--single score</code> will apply score test.

# In trait column, B and Q stand for binary, quantitiave trait.

== Burden tests ==

<table>
<thead>
<tr class="header">
<th align="left">Burden tests</th>
<th align="center">Model(*)</th>
<th align="center">Traits(#)</th>
<th align="center">Covariates</th>
<th align="center">Related / unrelated</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">CMC</td>
<td align="center">cmc</td>
<td align="center">B, Q</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Collapsing and combine rare variants by Bingshan Li.</td>
</tr>
<tr class="even">
<td align="left">Zeggini</td>
<td align="center">zeggini</td>
<td align="center">B, Q</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Aggregate counts of rare variants by Morris Zeggini.</td>
</tr>
<tr class="odd">
<td align="left">Madsen-Browning</td>
<td align="center">mb</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Up-weight rare variant using inverse frequency from controls by Madsen.</td>
</tr>
<tr class="even">
<td align="left">Fp</td>
<td align="center">fp</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Up-weight rare variant using inverse frequency from controls by Danyu Lin.</td>
</tr>
<tr class="odd">
<td align="left">Exact CMC</td>
<td align="center">exactCMC</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Collapsing and combine rare variants, then pefore Fisher's exact test.</td>
</tr>
<tr class="even">
<td align="left">RareCover</td>
<td align="center">rarecover</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Find optimal grouping unit for rare variant tests by Thomas Hoffman.</td>
</tr>
<tr class="odd">
<td align="left">CMAT</td>
<td align="center">cmat</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Test non-coding variants by Matt Z.</td>
</tr>
<tr class="even">
<td align="left">CMC Wald</td>
<td align="center">cmcWald</td>
<td align="center">B, Q</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Collapsing and combine rare variants, then pefore Wald test.</td>
</tr>
</tbody>
</table>

(*) Model columns list the regconized names in rvtests. For example, use <code>--burden cmc</code> will apply CMC test.

# In trait column, B and Q stand for binary, quantitiave trait.

== Variable threshold models ==

<table>
<thead>
<tr class="header">
<th align="left">Single variant</th>
<th align="center">Model(*)</th>
<th align="center">Traits(#)</th>
<th align="center">Covariates</th>
<th align="center">Related / unrelated</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Variable threshold model</td>
<td align="center">vt</td>
<td align="center">B, Q</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Every rare-variant frequency cutoffs are tests by Alkes Price.</td>
</tr>
<tr class="even">
<td align="left">Variable threshold CMC</td>
<td align="center">cmc</td>
<td align="center">B, Q</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">This models is natiive so that it output CMC test statistics under all possible frequency cutoffs.</td>
</tr>
</tbody>
</table>

(*) Model columns list the regconized names in rvtests. For example, use <code>--vt price</code> will apply score test.

# In trait column, B and Q stand for binary, quantitiave trait.

== Kernel models ==

<table>
<thead>
<tr class="header">
<th align="left">Kernel</th>
<th align="center">Model(*)</th>
<th align="center">Traits(#)</th>
<th align="center">Covariates</th>
<th align="center">Related / unrelated</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">SKAT</td>
<td align="center">skat</td>
<td align="center">B, Q</td>
<td align="center">Y</td>
<td align="center">U</td>
<td align="left">Sequencing kernel association test by Shawn Lee.</td>
</tr>
<tr class="even">
<td align="left">KBAC</td>
<td align="center">kbac</td>
<td align="center">B</td>
<td align="center">N</td>
<td align="center">U</td>
<td align="left">Kernel-based adaptive clustering model by Dajiang Liu.</td>
</tr>
</tbody>
</table>

(*) Model columns list the regconized names in rvtests. For example, use <code>--kernel skat</code> will apply SKAT test.

# In trait column, B and Q stand for binary, quantitiave trait.

== Utility models ==

Rvtests has an usually option <code>--outputRaw</code>. When specify this, rvtests can output genotypes, phenotype, covariates(if any) and collapsed genotype to tabular files. These files can be imported into other software (e.g. R) for further analysis.

= Association test options =

== Sample inclusion/exclusion ==

Rvtests can flexibly specify which sample(s) to include or exclude:

<pre> --peopleIncludeID : give IDs of people that will be included in study
--peopleIncludeFile : from given file, set IDs of people that will be included in study
--peopleExcludeID : give IDs of people that will be included in study
--peopleExcludeFile : from given file, set IDs of people that will be included in study</pre>
<code>--peopleIncludeID</code> and <code>--peopleExcludeID</code> are used to include/exclude samples from command line. For example, specify <code>--peopleIncludeID A,B,C</code> will include A, B and C sample from the VCF files if they exists. <code>--peopleIncludeID</code> and <code>--peopleExcludeID</code> followed by a file name will include or exclude the IDs in the file. So to include sample A, B and C, you can provide a file, <code>people.txt</code>, looks like:

<pre>A
B
C</pre>
Then use <code>--peopleIncludeFile people.txt</code> to include them in the analysis.

== Variant site filters ==

It is common that different frequency cutoffs are applied in rare-variant analysis. Therefore, rvtests specify frequency cutoffs.

Frequency Cutoff

<pre> --freqUpper : Specify upper frequency bound to be included in analysis
--freqLower : Specify lower frequency bound to be included in analysis</pre>
Similar to sample inclusion/exclusion options, you can specify a range of variants to be included by specifying <code>--rangeList</code> option. For example <code>--rangeList 1:100-200</code> will include the chromosome 1 position 100bp to 200bp region. Alternatively, use a separate file, <code>range.txt</code>, and <code>--rangeFile range.txt</code> to speicify association tests range.

<pre> --rangeList : Specify some ranges to use, please use chr:begin-end format.
--rangeFile : Specify the file containing ranges, please use chr:begin-end format.
--siteFile : Specify the file containing sites to include, please use &quot;chr pos&quot; format.</pre>
It is supported to filter variant site by site depth, minor allele count or annotation (annotated VCF file is needed).

<pre> --siteDepthMin : Specify minimum depth(inclusive) to be incluced in analysis
--siteDepthMax : Specify maximum depth(inclusive) to be incluced in analysis
--siteMACMin : Specify minimum Minor Allele Count(inclusive) to be incluced in analysis
--annoType : Specify annotation type that is follwed by ANNO= in the VCF INFO field, regular expression is allowed</pre>
''NOTE'': <code>--annoType Nonsynonymous</code> will only analyze nonsynonymous variants where they have <code>ANNO=Nonsynonymous</code> in the INFO field. VCF with annotatino information are called annotated VCF here. And to annotate a VCF file, you can use [https://github.com/zhanxw/anno ANNO], a fast and accurate annotation software.

== Genotyep filters ==

Genotype with low depth or low quality can be filtered out by these options:

<pre> --indvDepthMin : Specify minimum depth(inclusive) of a sample to be incluced in analysis
--indvDepthMax : Specify maximum depth(inclusive) of a sample to be incluced in analysis
--indvQualMin : Specify minimum depth(inclusive) of a sample to be incluced in analysis</pre>
When genotypes are filtered, they are marked as missing genotypes. Consequently, samples with missing genotype may or may not be included in the analysis. That means samples with genotypes may be dropped (<code>--impute drop</code>) or may still be included (<code>--impute mean</code> or <code>--impute hwe</code>). By default, genotypes are imputed to its means. See next section about how you like to handle missing genotypes.

== Handle missing genotypes and phenotypes ==

When genotypes are missing (e.g. genotype = &quot;./.&quot;) or gentoypes are filtered out, there are three options to handle them: (1) impute to its mean(default option); (2) impute by HWE equilibrium; (3) remove from the model. Use <code>--impute [mean|hwe|drop]</code> to specify which option to use.

When quantitative phenotypes are missing, for example, some samples have gneotype files, but not phenotypes, rvtests can impute missing phenotype to its mean.

''NOTE:'' Do not use <code>--imputePheno</code> for binary trait.

In summary, the following two options can be used:

<pre> --impute : Specify either of mean, hwe, and drop
--imputePheno : Impute phenotype to mean by those have genotypes but no
phenotpyes

</pre>
== Specify groups (e.g burden unit) ==

Rare variants association tests are usually performed in gruops of variants. The natural grouping unit is gene. Rvtests can read gene definition file in <code>refFlat</code> format, and perform association for each gene. Use <code>--geneFile</code> option to specify the gene file name. For example, <code>--geneFile refFlat_hg19.txt.gz</code> will use <code>refFlat_hg19.txt.gz</code> as gene definition file, and then perform association tests for every gene. Use <code>--gene</code> to specify a subset of genes to test. For example, <code>--gene CFH</code> will only test CFH gene.

Alternative grouping unit can be specified as ''set''. These ''sets'' are treated similar to gene. You can thus use <code>--setFile</code> to define sets (similar to <code>--geneFile</code> option), and use <code>--set</code> to define a specific set (similar to <code>--gene</code> option). Additionally, use <code>--setList</code> can speicify a set to test from command line.

The format of a set file is: (1) set names; (2) ranges (e.g. chrom:begin-end); For example, you have a set file, <code>example.set</code>, like this:

<pre>set1 1:100-200,1:250-300
set2 2:500-600</pre>
You can specify <code>--setFile example.set --set set2</code> to group variants within chromosome 2, position 500 to 600bp. If you want to test a particular region, for example, chromosome 2, position 500 to 550bp, but do not want to make another file, you can use <code>--setList 2:500-600</code>.

In summary, options related to ''Grouping Unit'' are listed below:

<pre> --geneFile : specify a gene file (for burden tests)
--gene : specify which genes to test
--setList : specify a list to test (for burden tests)
--setFile : specify a list file (for burden tests, first two columns:
setName chr:beg-end)
--set : specify which set to test (1st column)</pre>
= Contact =

Questions and requests can be sent to Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])

Rvtests is a collaborative effort by Youna Hu, Bingshan Li, Dajiang Liu.
255

edits

Navigation menu