RelativeFinder
relativeFinder is a program for checking relationships between pairs of individuals. There are many excellent programs that carry out similar tasks. Some of the unique features in relativeFinder are the batch mode options, that allow large jobs to be divided into many smaller jobs (suitable for deployment on a compute cluster environment), and the flexibility of the underlying Merlin engine, which allows relative finder to handle large pedigrees and consider a variety of alternate relationships -- including potential relationships specified by the user on the fly!
How It Works
relativeFinder examines every pair of genotyped individuals in a pedigree. Then, it calculates the likelihood of the pair of the observed genotype data for the pair using the relationship specified in the pedigree and using a series of alternate relationships (described in a template pedigree file). If any of these alternate relationships makes the observed genotype data more likely, the pair is flagged.
Before evaluating relationship likelihoods, relativeFinder will flag individuals with outlier patterns for heterozygosity or missing genotypes. To do this, relativeFinder checks whether the fraction of heterozygous sites or the total number of available genotypes differs froom the sample average by three standard deviations or more. Results for any flagged samples should be treated with caution, as problematic heterozygosity or missing data patterns can lead to odd results for the relativeFinder algorithm.
Command Line Options
- -d datafile -p pedigreefile -m mapfile
- A set of required input files in Merlin format. These files list the individuals to be evaluated. It is recommended that a whole genome worth of data should be available.
- -f alelleFrequencyModel
- Allele frequencies can be provided in a Merlin format file (
-f filename
) or can be calculated from the available pedigree option. In the later case, recommended options are-fa
(estimate frequencies from all available genotypes),-ff
(estimate frequencies from founder genotypes only) or-fm
(estimate frequencies using a maximum likelihood algorithm).
- --perAllele errorRate
- Specifies the genotyping error rate per allele; the default is 0.002.
- --perGenotype errorRate
- Specifies the genotyping error rate per genotype. This option and the
--perAllele
option are mutually exclusive.
- --allPairs
- Specifies that all possible pairings should be considered
- --withinFamily
- Specifies that only within family pairings should be considered
- --job k --of N
- Specifies that the analysis should be divided into N parallel batches and that the current invocation corresponds to batch k.
Alternative Relationships
The set of relationships to be considered by relative finder is described in a pair of merlin format data and pedigree files. Here is what they might looi like:
Example of testCases.dat file
Z zygosity
In this case, the data file simply indicates that the five canonical columns in the pedigree (family id, individual id, father, mother and sex) will be followed by a column indicated whether individuals are MZ twins.
Example of testCases.ped file
MZTWIN 1 0 0 1 0
MZTWIN 2 0 0 2 0
MZTWIN TEST1 1 2 1 MZ
MZTWIN TEST2 1 2 2 MZ
SIBS 1 0 0 1 0
SIBS 2 0 0 2 0
SIBS TEST1 1 2 1 0
SIBS TEST2 1 2 2 0
HALFSIBS 1 0 0 1 0
HALFSIBS 2 0 0 2 0
HALFSIBS 3 0 0 2 0
HALFSIBS TEST1 1 2 1 0
HALFSIBS TEST2 1 3 2 0
AVUNCULAR 1 0 0 1 0
AVUNCULAR 2 0 0 2 0
AVUNCULAR 3 1 2 1 0
AVUNCULAR TEST1 1 2 2 0
AVUNCULAR 4 0 0 2 0
AVUNCULAR TEST2 3 4 1 0
PARENT-OFFSPRING 1 0 0 1 0
PARENT-OFFSPRING TEST1 0 0 2 0
PARENT-OFFSPRING TEST2 1 TEST1 2 0
UNRELATED TEST1 0 0 1 0
UNRELATED TEST2 0 0 2 0
In this case, the pedigree describes 6 alternate relationships (identical twins, siblings, half-siblings, avuncular pairs, parent-offspring pairs and unrelated individuals). In each putative relationship, the placement of the individuals being evaluated is indicated by the TEST1 and TEST2 tags.
Download
A source package is available for download from here.
Current Limitations and Todo List
The current implementation does not include support for X linked markers and should only be used with autosomal markers.
The current implementation looks for a pair of files called testCases.dat and testCases.ped in the current working directory. This pair of files specifies alternate relationships to be considered.
The current implementation simply reports the most likely relationship and the difference in likelihood between this relationship and the originally specified relationship. It would be better to use an E-M algorithm to calculate a prior probability for each relationship and to only report as problematic pairs where the posterior probability of a mis-specified relationship is high.