Verifying Sample Identities - Implementation
Principle
We should be able to verify that the right sample has been sequenced by comparing base calls in a read to known genotypes for a sample. If the sample has been sequenced correctly, the base calls should match previously known genotypes. If the wrong sample has been sequenced, we will see quite a bit more mismatches.
Mathematical Details
For each sample, we would like to calculate the likelihood of a set of reads assuming that we sequenced the correct sample, assuming we sequenced a sample related to the correct sample, or assuming we sequenced an incorrect sample. We would then like to flag samples where it appears likely that the wrong sample has been sequenced.
If we have a list of bases that overlap a known genotype, we can will describe the probability of a matching of mismatching base using the following notation:
Variable | Definition |
A/A | Previously known genotype; we only consider homozygous sites. |
P_{A} | Frequency of allele A in the population |
P_{ibd} | Probability that the sequenced sample and the target sample share a chromosome. This should be 1.0 when we have sequenced the correct sample and 0.0 if we sequence an unrelated sample. If we sequence a related sample (e.g. a parent or sibling of the target sample), we will see intermediate values. |
Estimate error rate for the current base in the sequence data. |
Then, the probabilities of interest are:
For now, genotypeIDcheck uses:
If we also wish to use heterozygous sites, rather than limiting our comparison to reference homozygous sites, we could use:
For any given value of we can calculate this quantity for each read that overlaps a site with a known homozygous genotype. In addition, we can take the product of this quantity across all sites examined -- because this product is likely to be very small, we actually sum the s of the appropriate quantities rather than multiplying them together.
To decide if we have sequenced the correct sample, we should do the following:
a) Evaluate this log-sum assuming . This assumes that we have sequenced the target sample.
b) Evaluate this log-sum assuming . This assumes we sequenced a different sample, unrelated to the target.
c) Evaluate this log-sum assuming . This assumes we sequenced a sample that shares half the genome with the target sample, perhaps because it is a sibling or parent of the target sample.
d) If desired, evaluate the same log-sum for other intermediate values of . It may be interesting to set to allow for 5% of reads that are derived from a different sample, for example, due to contamination. It may be interesting to set to consider more distant relatives.
Once the result of evaluating a), b), c) and d) are available, we can decide if the target sample has been sequenced. Sequencing the target sample will mean that the log-sum in a) is the largest. Sequencing a parent or offspring of the target sample will maximize c). Sequencing a completely incorrect sample will maximize b).
If all the log-sums are very similar, then we don't have enough information to make a clear cut decision. Typically, we thousands of genetic markers from a typical SNP chip and whole genome shotgun sequence data, most decisions should be very clear cut.
Implementation Details
After loading genotypes, we generate a genome mask for each position. There are three outcomes of interest:
- Known Genotypes
- These are sites where we have a previously observed a genotype call and where we will be evaluating match / mismatch rates to determine sample identity.
- dbSNP sites
- These are sites that are known to vary among individuals, but for which a known genotype is not available.
- Background sites
- These are all other sites and can be used to estimate the error rate parameter.