Verifying Sample Identities - Implementation

From Genome Analysis Wiki
Jump to: navigation, search

Principle

We should be able to verify that the right sample has been sequenced by comparing base calls in a read to known genotypes for a sample. If the sample has been sequenced correctly, the base calls should match previously known genotypes. If the wrong sample has been sequenced, we will see quite a bit more mismatches.

Mathematical Details

For each sample, we would like to calculate the likelihood of a set of reads assuming that we sequenced the correct sample, assuming we sequenced a sample related to the correct sample, or assuming we sequenced an incorrect sample. We would then like to flag samples where it appears likely that the wrong sample has been sequenced.

If we have a list of bases that overlap a known genotype, we can will describe the probability of a matching of mismatching base using the following notation:

Notation
Variable Definition
A/A Previously known genotype; we only consider homozygous sites.
PA Frequency of allele A in the population
Pibd Probability that the sequenced sample and the target sample share a chromosome. This should be 1.0 when we have sequenced the correct sample and 0.0 if we sequence an unrelated sample. If we sequence a related sample (e.g. a parent or sibling of the target sample), we will see intermediate values.
\epsilon Estimate error rate for the current base in the sequence data.

Then, the probabilities of interest are:


P(match) = P_{ibd} (1 - \epsilon) + (1 - P_{ibd}) P_A


P(no match) = P_{ibd} \epsilon + (1 - P_{ibd}) (1 - P_A)

For now, genotypeIDcheck uses:


P(no match) = 1 - P(match)

If we also wish to use heterozygous sites, rather than limiting our comparison to reference homozygous sites, we could use:


P(match) = P(no match) = P_{ibd} 0.5 + (1.0 - P_{ibd}) P_A

For any given value of P_{ibd} we can calculate this quantity for each read that overlaps a site with a known homozygous genotype. In addition, we can take the product of this quantity across all sites examined -- because this product is likely to be very small, we actually sum the logs of the appropriate quantities rather than multiplying them together.

To decide if we have sequenced the correct sample, we should do the following:

a) Evaluate this log-sum assuming P_{ibd} = 1.0. This assumes that we have sequenced the target sample.

b) Evaluate this log-sum assuming P_{ibd} = 0.0. This assumes we sequenced a different sample, unrelated to the target.

c) Evaluate this log-sum assuming P_{ibd} = 0.5. This assumes we sequenced a sample that shares half the genome with the target sample, perhaps because it is a sibling or parent of the target sample.

d) If desired, evaluate the same log-sum for other intermediate values of P_{ibd}. It may be interesting to set P_{ibd} = 0.95 to allow for 5% of reads that are derived from a different sample, for example, due to contamination. It may be interesting to set P_{ibd} = 0.05 to consider more distant relatives.

Once the result of evaluating a), b), c) and d) are available, we can decide if the target sample has been sequenced. Sequencing the target sample will mean that the log-sum in a) is the largest. Sequencing a parent or offspring of the target sample will maximize c). Sequencing a completely incorrect sample will maximize b).

If all the log-sums are very similar, then we don't have enough information to make a clear cut decision. Typically, we thousands of genetic markers from a typical SNP chip and whole genome shotgun sequence data, most decisions should be very clear cut.

Implementation Details

After loading genotypes, we generate a genome mask for each position. There are three outcomes of interest:

Known Genotypes
These are sites where we have a previously observed a genotype call and where we will be evaluating match / mismatch rates to determine sample identity.
dbSNP sites
These are sites that are known to vary among individuals, but for which a known genotype is not available.
Background sites
These are all other sites and can be used to estimate the \epsilon error rate parameter.