Difference between revisions of "RareSimu"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with "Genetic Model-based Simulator [GMS] is an efficient c++ program for simulating case control data sets based on genetic models. The input is a pool of haplotypes and a text fil...")
 
(Download)
 
Line 59: Line 59:
 
== Download ==
 
== Download ==
  
The current version is available for download from http://www.sph.umich.edu/csg/weich/GMS.tar.gz
+
The current version is available for download from http://csg.sph.umich.edu//weich/GMS.tar.gz
  
 
== TODO ==
 
== TODO ==

Latest revision as of 10:11, 2 February 2017

Genetic Model-based Simulator [GMS] is an efficient c++ program for simulating case control data sets based on genetic models. The input is a pool of haplotypes and a text file for model specification. The output is a set of simulated datasets in the format of Merlin ped file.

Basic Usage Example

In a typical command line, a few options need to be specified together with the input files. Here is an example of how GMS works:

./GMS --hapfile test.hap --snplist test.lst --model model.heter.txt --f0 0.01 -- nrep 100 --ncase 250 --nctrl 250 --causal --prefix tmp

Command Line Options

Basic Output Options

 --hapfile       a pool of simulated or real haplotypes, one chromosome per row
 --snplist       snp names in the order ofhaplotypes in hapfile, one snp per row
 --model         a model file specifying genetic models, see below for details
 --nrep          the number of replications
 --seed          seed for random number generator
 --ncase         the number of cases in each replicate
 --nctrl         the number of controls in each replicate
 --f0            overall baseline prevalence
 --prefix        prefix of output files (e.g. prefix.rep1.ped, prefix.rep2.ped)
 --causal        only generate causal SNPs in the output pedigree file


Model File Annotation

The model file includes one header line and multiple rows after. Each row responding to a set of SNPs with desired frequency range and relate risk (RR) or odds ratio (OR)

1. Heterogeneity Model

a) COUNT FREQ_MIN FREQ_MAX RR1 RR2

b) FRACTION FREQ_MIN FREQ_MAX RR1 RR2

2. Logistic Model

a) COUNT FREQ_MIN FREQ_MAX OR1 OR2

b) FRACTION FREQ_MIN FREQ_MAX OR1 OR2

How It Works

There are two underlying models. Disease status follows a Bernoulli distribution with P

1. Heterogeneity Model  P(D | (AA,AA,...,AA)) = f_0

 P = \sum_{i=1}^N P(D|x_i)


2. Logistic Model

logit(y) = \beta_0 + \sum_{i=1}^{N}\beta_i\times x_i

 P = \frac{e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}{1+e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}

Download

The current version is available for download from http://csg.sph.umich.edu//weich/GMS.tar.gz

TODO

1. Support Quantitative trait.

2. Support family structures.

3. Support more "reasonable" models.