RelativeFinder

From Genome Analysis Wiki
Revision as of 04:17, 23 November 2010 by Goncalo (talk | contribs) (→‎Current Limitations and Todo List)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

relativeFinder is a program for checking relationships between pairs of individuals. There are many excellent programs that carry out similar tasks. Some of the unique features in relativeFinder are the batch mode options, that allow large jobs to be divided into many smaller jobs (suitable for deployment on a compute cluster environment), and the flexibility of the underlying Merlin engine, which allows relative finder to handle large pedigrees and consider a variety of alternate relationships -- including potential relationships specified by the user on the fly!

How It Works

relativeFinder examines every pair of genotyped individuals in a pedigree. Then, it calculates the likelihood of the pair of the observed genotype data for the pair using the relationship specified in the pedigree and using a series of alternate relationships (described in a template pedigree file). If any of these alternate relationships makes the observed genotype data more likely, the pair is flagged.

Before evaluating relationship likelihoods, relativeFinder will flag individuals with outlier patterns for heterozygosity or missing genotypes. To do this, relativeFinder checks whether the fraction of heterozygous sites or the total number of available genotypes differs froom the sample average by three standard deviations or more. Results for any flagged samples should be treated with caution, as problematic heterozygosity or missing data patterns can lead to odd results for the relativeFinder algorithm.

Command Line Options

-d datafile -p pedigreefile -m mapfile
A set of required input files in Merlin format. These files list the individuals to be evaluated. It is recommended that a whole genome worth of data should be available.
-t testRelationshipData -q testRelationshipPedigree
An optional set of files, also in Merlin format, that describe relationships to be considered by RelativeFinder.
-f alelleFrequencyModel
Allele frequencies can be provided in a Merlin format file (-f filename) or can be calculated from the available pedigree option. In the later case, recommended options are -fa (estimate frequencies from all available genotypes), -ff (estimate frequencies from founder genotypes only) or -fm (estimate frequencies using a maximum likelihood algorithm).
--perAllele errorRate
Specifies the genotyping error rate per allele; the default is 0.002.
--perGenotype errorRate
Specifies the genotyping error rate per genotype. This option and the --perAllele option are mutually exclusive.
--allPairs
Specifies that all possible pairings should be considered
--withinFamily
Specifies that only within family pairings should be considered
--job k --of N
Specifies that the analysis should be divided into N parallel batches and that the current invocation corresponds to batch k.

Alternative Relationships

The set of relationships to be considered by RelativeFinder can be specified in a pair of merlin format data and pedigree files. Names for these files can be specified with the -t (for the data file) and -q (for the pedigree file) command line options. Within this pedigree, each family should describe one potential relationship. The placement of the individuals within the test pedigree should be labeled with the words "TEST1" and "TEST2".

If these files are not available, RelativeFinder will consider a basic set of relationships for each pair of individuals (siblings, half-siblings, identical twins, parent-offspring pairs, unrelated individuals, or an avuncular relationship).

Here files specifying relationships to be considered might look like:

Example of testCases.dat file

Z zygosity

In this case, the data file simply indicates that the five canonical columns in the pedigree (family id, individual id, father, mother and sex) will be followed by a column indicated whether individuals are MZ twins.

Example of testCases.ped file

MZTWIN        1   0    0    1  0
MZTWIN        2   0    0    2  0
MZTWIN    TEST1   1    2    1  MZ
MZTWIN    TEST2   1    2    2  MZ

SIBS          1   0    0    1  0
SIBS          2   0    0    2  0
SIBS      TEST1   1    2    1  0
SIBS      TEST2   1    2    2  0

HALFSIBS      1   0    0    1  0
HALFSIBS      2   0    0    2  0
HALFSIBS      3   0    0    2  0
HALFSIBS  TEST1   1    2    1  0
HALFSIBS  TEST2   1    3    2  0

AVUNCULAR     1   0    0    1  0
AVUNCULAR     2   0    0    2  0
AVUNCULAR     3   1    2    1  0
AVUNCULAR TEST1   1    2    2  0
AVUNCULAR     4   0    0    2  0
AVUNCULAR TEST2   3    4    1  0


PARENT-OFFSPRING      1     0      0     1  0
PARENT-OFFSPRING  TEST1     0      0     2  0
PARENT-OFFSPRING  TEST2     1  TEST1     2  0

UNRELATED TEST1   0    0    1  0
UNRELATED TEST2   0    0    2  0

In the example above, the pedigree describes 6 alternate relationships that match the default relationships considered by RelativeFinder(identical twins, siblings, half-siblings, avuncular pairs, parent-offspring pairs and unrelated individuals). Note that, for each putative relationship, the placement of the individuals being evaluated is indicated by the TEST1 and TEST2 tags.

Download

A source package is available for download from here.

Current Limitations and Todo List

The current implementation does not include support for X linked markers and should only be used with autosomal markers.

The current implementation simply reports the most likely relationship and the difference in log-likelihood between this relationship and the originally specified relationship. It would be better to use an E-M algorithm to calculate a prior probability for each relationship and to only report as problematic pairs where the posterior probability of a mis-specified relationship is high.