ExomePicks is a program that suggests individuals to be sequenced in a large pedigree. ExomePicks assumes that a genotyping chip or another cost effective means will be used to determine IBD sharing in the pedigree and that, subsequently, one would like to sequence a minimal number of individuals and use their sequences together with IBD information to deduce the sequence of other individuals in the pedigree. We are currently using it in the context of whole exome and whole genome sequencing studies to pick individuals to be sequenced from large family collections.
An extended version of the package which produces output for the visualization with ExomePicksViewer is under development; for details contact Christian Fuchsberger.
The data file describes the contents of the pedigree file and should include, minimally, an entry to specify which individuals in the pedigree are genotyped. An additional entry to indicate which individuals are affected for a trait of interest can also be included, but is not required. Any other entries that are present will be safely ignored.
Here is an example of a minimal data file:
< contents of small.dat > A dna A disease
The pedigree file describes relationships among individuals and indicates samples for whom DNA is available (and who can thus be selected for sequencing) and who are affected (and, perhaps, more valuable).
Here is an example of a simple pedigree file:
< contents of small.ped > 1 I1 0 0 1 2 1 1 I2 0 0 2 2 1 1 I3 0 0 1 2 1 1 I4 0 0 2 2 1 1 II1 I1 I2 1 2 1 1 II2 I3 I4 2 2 1 1 III1 0 0 1 2 1 1 III2 II1 II2 2 2 1 1 III3 II1 II2 2 2 1 1 III4 II1 II2 2 2 1 1 III5 II1 II2 2 2 1 1 IV1 III1 III2 1 2 1 1 IV2 III1 III2 1 2 1
If you have pen and paper, you should be able to verify that the file describes a simple four generation pedigree. The first five columns denote family id, individual id, father id, mother id, and sex. DNA is available for all individuals (value 2 in DNA column) and all individuals are unaffected (value 1 in disease column).
ExomePicks currently ignores any information on twin status that may be present.
The only essential command line options are those that specify input file names, thus:
ExomePicks -d small.dat -p small.ped
The program loops through sibships, starting at the top of the the pedigree and suggests individuals for sequencing as it moves through. In pedigrees where DNA samples are available for everyone, it selects every founder (to identify all segregating chromosomes) plus at least one offspring per founder (to determine phase). When founder DNA is missing, it selects additional offspring in for each founder couple (if possible) or in sibships internal to the pedigree (if a DNA sample is not available for founder couple offspring, for example).
ExomePicks summarizes its suggestions in three files.
Per Nuclear Family
Because suggestions are made one nuclear family at a time, this file should be considered the primary output of the program. The default name for the file is perFamily-sequencing.txt. Here is an example:
FAMID ID1 ID2 ID3 TYPE SEQ RETURN RATIO 1 I1 I2 II1 TRIO 3 5.50 1.83 1 I3 I4 II2 TRIO 3 5.50 1.83 1 III1 IV1 -- FATHER 2 2.00 1.00
In this particular pedigree, individuals for sequencing were selected from three nuclear families. In the first two nuclear families, a parent-offspring trio was selected (resulting in three genotyped individuals). In each of these cases sequencing 3 individuals will provide information on approximately 5.5 genomes. In the final nuclear family, a father-offspring pair was selected for sequencing. This particular pair requires 2 individuals to be sequenced and provides information on only 2 genomes (if other nuclear families in the pedigree are sequenced as suggested).
The program also summarizes its suggestions on a per individual basis. Although it is attractive to select one individual at a time for sequencing, it is important to note that some individuals (e.g. the offspring of a trio) don't contribute information on new genomes (their genome is contained in their parent's genome) but do provide essential information about phase. In general, it is probably safer to select individuals for sequencing based on the per nuclear family output.
Here is an example:
FAMID ID WHO VALUE AFFVALUE 1 I1 FATHER 2.75 0.00 1 I2 MOTHER 2.75 0.00 1 II1 CHILD 0.00 0.00 1 I3 FATHER 2.75 0.00 1 I4 MOTHER 2.75 0.00 1 II2 CHILD 0.00 0.00 1 III1 FATHER 2.00 0.00 1 IV1 CHILD 0.00 0.00
One useful feature of this file is that it also includes information on how many disease genomes can be deduced from sequencing each individual.
A final output file provides summary statistics for each extended family. Here is an example:
FAMID DNA SEQ RETURN RATIO 1 13 8 13.00 1.62
In this case, sequencing 8 individuals would provide information on 13 genomes.
Improvements Under Consideration
- Change sorting so that the most valuable individuals for each pedigree are picked first. When resources are limited, it might not be affordable to sequence enough individuals to completely impute each pedigree. This scoring order would allow one to more easily focus on the highest value individuals in each pedigree.
- Implement the ability to evaluate only nuclear families, ignoring the possibility of imputing more distant relatives.
- Add the summed number of sequenced or imputed affecteds to the per family output.
- Unrelated single individuals (founders without offspring) are currently scored as zero value. These should more accurately be scored as single genomes.
Weimin Chen and Serena Sanna for discussions that contributed to the initial version of this program. John Blangero at the Southwest Foundation for testing initial version.