RelativeFinder

From Genome Analysis Wiki
Jump to navigationJump to search

relativeFinder is a program for checking relationships between pairs of individuals. There are many excellent programs that carry out similar tasks. Some of the unique features in relativeFinder are the batch mode options, that allow large jobs to be divided into many smaller jobs (suitable for deployment on a compute cluster environment), and the flexibility of the underlying Merlin engine, which allows relative finder to handle large pedigrees and consider a variety of alternate relationships -- including potential relationships specified by the user on the fly!

How It Works

relativeFinder examines every pair of genotyped individuals in a pedigree. Then, it calculates the likelihood of the pair of the observed genotype data for the pair using the relationship specified in the pedigree and using a series of alternate relationships (described in a template pedigree file). If any of these alternate relationships makes the observed genotype data more likely, the pair is flagged.

Command Line Options

-d datafile -p pedigreefile -m mapfile
A set of required input files in Merlin format. These files list the individuals to be evaluated. It is recommended that a whole genome worth of data should be available.
-f alelleFrequencyModel
Allele frequencies can be provided in a Merlin format file (-f filename) or can be calculated from the available pedigree option. In the later case, recommended options are -fa (estimate frequencies from all available genotypes), -ff (estimate frequencies from founder genotypes only) or -fm (estimate frequencies using a maximum likelihood algorithm).
--perAllele errorRate
Specifies the genotyping error rate per allele; the default is 0.002.
--perGenotype errorRate
Specifies the genotyping error rate per genotype. This option and the --perAllele option are mutually exclusive.
--allPairs
Specifies that all possible pairings should be considered
--withinFamily
Specifies that only within family pairings should be considered
--job k --of N
Specifies that the analysis should be divided into N parallel batches and that the current invocation corresponds to batch k.

Alternative Relationships

The set of relationships to be considered by relative finder is described in a pair of merlin format data and pedigree files. Here is what they might looi like:

Example of testCases.dat file

Z zygosity

In this case, the data file simply indicates that the five canonical columns in the pedigree (family id, individual id, father, mother and sex) will be followed by a column indicated whether individuals are MZ twins.

Example of testCases.ped file

MZTWIN        1   0    0    1  0
MZTWIN        2   0    0    2  0
MZTWIN    TEST1   1    2    1  MZ
MZTWIN    TEST2   1    2    2  MZ

SIBS          1   0    0    1  0
SIBS          2   0    0    2  0
SIBS      TEST1   1    2    1  0
SIBS      TEST2   1    2    2  0

HALFSIBS      1   0    0    1  0
HALFSIBS      2   0    0    2  0
HALFSIBS      3   0    0    2  0
HALFSIBS  TEST1   1    2    1  0
HALFSIBS  TEST2   1    3    2  0

AVUNCULAR     1   0    0    1  0
AVUNCULAR     2   0    0    2  0
AVUNCULAR     3   1    2    1  0
AVUNCULAR TEST1   1    2    2  0
AVUNCULAR     4   0    0    2  0
AVUNCULAR TEST2   3    4    1  0


PARENT-OFFSPRING      1     0      0     1  0
PARENT-OFFSPRING  TEST1     0      0     2  0
PARENT-OFFSPRING  TEST2     1  TEST1     2  0

UNRELATED TEST1   0    0    1  0
UNRELATED TEST2   0    0    2  0

In this case, the pedigree describes 6 alternate relationships (identical twins, siblings, half-siblings, avuncular pairs, parent-offspring pairs and unrelated individuals). In each putative relationship, the placement of the individuals being evaluated is indicated by the TEST1 and TEST2 tags.

Download

A source package is available for download from here.

Current Limitations and Todo List

The current implementation does not include support for X linked markers and should only be used with autosomal markers.

The current implementation looks for a pair of files called testCases.dat and testCases.ped in the current working directory. This pair of files specifies alternate relationships to be considered.

The current implementation simply reports the most likely relationship and the difference in likelihood between this relationship and the originally specified relationship. It would be better to use an E-M algorithm to calculate a prior probability for each relationship and to only report as problematic pairs where the posterior probability of a mis-specified relationship is high.