ExomePicks

From Genome Analysis Wiki
Jump to: navigation, search

ExomePicks is a program that suggests individuals to be sequenced in a large pedigree. ExomePicks assumes that a genotyping chip or another cost effective means will be used to determine IBD sharing in the pedigree and that, subsequently, one would like to sequence a minimal number of individuals and use their sequences together with IBD information to deduce the sequence of other individuals in the pedigree. We are currently using it in the context of whole exome and whole genome sequencing studies to pick individuals to be sequenced from large family collections.

Download

A source code package can be downloaded from here. A Windows version can be found here‎

An extended version of the package which produces output for the visualization with ExomePicksViewer is under development; for details contact Christian Fuchsberger.

Input Files

A pedigree and data file in Merlin format are required as input.

The data file describes the contents of the pedigree file and should include, minimally, an entry to specify which individuals in the pedigree are genotyped. An additional entry to indicate which individuals are affected for a trait of interest can also be included, but is not required. Any other entries that are present will be safely ignored.

Here is an example of a minimal data file:

 < contents of small.dat >
 A dna
 A disease

The pedigree file describes relationships among individuals and indicates samples for whom DNA is available (and who can thus be selected for sequencing) and who are affected (and, perhaps, more valuable).

Here is an example of a simple pedigree file:

 < contents of small.ped >
 1 I1   0    0    1  2 1
 1 I2   0    0    2  2 1
 1 I3   0    0    1  2 1
 1 I4   0    0    2  2 1
 1 II1  I1   I2   1  2 1
 1 II2  I3   I4   2  2 1
 1 III1 0    0    1  2 1
 1 III2 II1  II2  2  2 1
 1 III3 II1  II2  2  2 1
 1 III4 II1  II2  2  2 1
 1 III5 II1  II2  2  2 1
 1 IV1  III1 III2 1  2 1
 1 IV2  III1 III2 1  2 1 

If you have pen and paper, you should be able to verify that the file describes a simple four generation pedigree. The first five columns denote family id, individual id, father id, mother id, and sex. DNA is available for all individuals (value 2 in DNA column) and all individuals are unaffected (value 1 in disease column).

ExomePicks currently ignores any information on twin status that may be present.

Command Line

The only essential command line options are those that specify input file names, thus:

 ExomePicks -d small.dat -p small.ped

Algorithm

The program loops through sibships, starting at the top of the the pedigree and suggests individuals for sequencing as it moves through. In pedigrees where DNA samples are available for everyone, it selects every founder (to identify all segregating chromosomes) plus at least one offspring per founder (to determine phase). When founder DNA is missing, it selects additional offspring in for each founder couple (if possible) or in sibships internal to the pedigree (if a DNA sample is not available for founder couple offspring, for example).

Outputs

ExomePicks summarizes its suggestions in three files.

Per Nuclear Family

Because suggestions are made one nuclear family at a time, this file should be considered the primary output of the program. The default name for the file is perFamily-sequencing.txt. Here is an example:

 FAMID   ID1     ID2     ID3     TYPE    SEQ     RETURN  RATIO
 1       I1      I2      II1     TRIO    3       5.50    1.83
 1       I3      I4      II2     TRIO    3       5.50    1.83
 1       III1    IV1     --      FATHER  2       2.00    1.00

In this particular pedigree, individuals for sequencing were selected from three nuclear families. In the first two nuclear families, a parent-offspring trio was selected (resulting in three genotyped individuals). In each of these cases sequencing 3 individuals will provide information on approximately 5.5 genomes. In the final nuclear family, a father-offspring pair was selected for sequencing. This particular pair requires 2 individuals to be sequenced and provides information on only 2 genomes (if other nuclear families in the pedigree are sequenced as suggested).

Per Individual

The program also summarizes its suggestions on a per individual basis. Although it is attractive to select one individual at a time for sequencing, it is important to note that some individuals (e.g. the offspring of a trio) don't contribute information on new genomes (their genome is contained in their parent's genome) but do provide essential information about phase. In general, it is probably safer to select individuals for sequencing based on the per nuclear family output.

Here is an example:

FAMID   ID      WHO     VALUE   AFFVALUE 
1       I1      FATHER  2.75    0.00
1       I2      MOTHER  2.75    0.00
1       II1     CHILD   0.00    0.00
1       I3      FATHER  2.75    0.00
1       I4      MOTHER  2.75    0.00
1       II2     CHILD   0.00    0.00
1       III1    FATHER  2.00    0.00
1       IV1     CHILD   0.00    0.00

One useful feature of this file is that it also includes information on how many disease genomes can be deduced from sequencing each individual.

Per Family

A final output file provides summary statistics for each extended family. Here is an example:

FAMID   DNA     SEQ     RETURN  RATIO
1       13      8       13.00   1.62

In this case, sequencing 8 individuals would provide information on 13 genomes.

Improvements Under Consideration

  • Change sorting so that the most valuable individuals for each pedigree are picked first. When resources are limited, it might not be affordable to sequence enough individuals to completely impute each pedigree. This scoring order would allow one to more easily focus on the highest value individuals in each pedigree.
  • Implement the ability to evaluate only nuclear families, ignoring the possibility of imputing more distant relatives.
  • Add the summed number of sequenced or imputed affecteds to the per family output.
  • Unrelated single individuals (founders without offspring) are currently scored as zero value. These should more accurately be scored as single genomes.

Acknowledgements

Weimin Chen and Serena Sanna for discussions that contributed to the initial version of this program. John Blangero at the Southwest Foundation for testing initial version.