Difference between revisions of "Verifying Sample Identities - Implementation"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with '== Principle == We should be able to verify that the right sample has been sequenced by comparing base calls in a read to known genotypes for a sample. If the sample has been s…')
 
Line 9: Line 9:
 
If we have a list of bases that overlap a known genotype, we can calculate the probability of a match or mismatch at each base as:  
 
If we have a list of bases that overlap a known genotype, we can calculate the probability of a match or mismatch at each base as:  
  
{| width="200" cellspacing="1" cellpadding="1" border="1" summary="Summary of Variables Used Below"
+
{| width="100%" cellspacing="1" cellpadding="1" border="1" summary="Summary of Variables Used Below"
|+ Notation
+
|+ Notation  
 
|-
 
|-
| Variable
+
| Variable  
 
| Definition
 
| Definition
 
|-
 
|-
| A/A
+
| A/A  
 
| Previously known genotype; we only consider homozygous sites.
 
| Previously known genotype; we only consider homozygous sites.
 
|-
 
|-
| <math>P_A</math>
+
| <span class="texhtml">''P''<sub>''A''</sub></span>  
 
| Frequency of allele A in the population
 
| Frequency of allele A in the population
 
|-
 
|-
| <math>P_{ibd}</math>
+
| <span class="texhtml">''P''<sub>''i''''b''''d''</sub></span>  
 
| Probability that the sequenced sample and the target sample share a chromosome. This should be 1.0 when we have sequenced the correct sample and 0.0 if we sequence an unrelated sample. If we sequence a related sample (e.g. a parent or sibling of the target sample), we will see intermediate values.
 
| Probability that the sequenced sample and the target sample share a chromosome. This should be 1.0 when we have sequenced the correct sample and 0.0 if we sequence an unrelated sample. If we sequence a related sample (e.g. a parent or sibling of the target sample), we will see intermediate values.
 
|-
 
|-
| <math>\{epsilon}</math>
+
| <math>\epsilon</math>
 
| Estimate error rate for the current base in the sequence data.
 
| Estimate error rate for the current base in the sequence data.
 
|}
 
|}

Revision as of 15:49, 13 April 2010

Principle

We should be able to verify that the right sample has been sequenced by comparing base calls in a read to known genotypes for a sample. If the sample has been sequenced correctly, the base calls should match previously known genotypes. If the wrong sample has been sequenced, we will see quite a bit more mismatches.

Mathematical Details

For each sample, we would like to calculate the likelihood of a set of reads assuming that we sequenced the correct sample, assuming we sequenced a sample related to the correct sample, or assuming we sequenced an incorrect sample. We would then like to flag samples where it appears likely that the wrong sample has been sequenced.

If we have a list of bases that overlap a known genotype, we can calculate the probability of a match or mismatch at each base as:

Notation
Variable Definition
A/A Previously known genotype; we only consider homozygous sites.
PA Frequency of allele A in the population
Pi'b'd Probability that the sequenced sample and the target sample share a chromosome. This should be 1.0 when we have sequenced the correct sample and 0.0 if we sequence an unrelated sample. If we sequence a related sample (e.g. a parent or sibling of the target sample), we will see intermediate values.
Estimate error rate for the current base in the sequence data.