SNP Call Set Properties
There are typically a number of properties that we check in SNP call sets. This page gives some useful pointers on what these quantities are and what to look for.
Proportion of dbSNPs
Most of the genetic variants in any one individual have been previously observed in other individuals. Thus, it is usually a good diagnostic to investigate what fraction of variants in an individual genome have been previously described in dbSNP.
The expected proportion of previously discovered SNPs (those already catalogued in dbSNP) and novel SNPs (those that haven't been previously discovered) will change overtime. The dbSNP database is being constantly updated so that currently (mid-2010) we'd expect >90% of the variants in an individual genome to have been previously discovered.
If many individuals are sequenced, the vast majority of common variants (those shared among many individuals) are expected to be in dbSNP, whereas a smaller fraction of newly discovered variants should be in dbSNP.
Transition to Transversion Ratio
Human mutations don't occur randomly. In fact, transitions (changes from A <-> G and C <-> T) are expected to occur twice as frequently as transversions (changes from A <-> C, A <-> T, G <-> C or G <-> T). Thus, another useful diagnostic is the ratio of transitions to transversions in a particular set of SNP calls. This ratio is often evaluated separately for previously discovered and novel SNPs.
Across the entire genome the ratio of transitions to transversions is typically around 2. In protein coding regions, this ratio is typically higher, often a little above 3. The higher ratio occurs because, especially when they occur in the third base of a codon, transversions are much more likely to change the encoded amino acid. A refinement to this analysis, in protein coding regions, is to examine the transition to transversion ratio separately for non-degenerate, two-fold degenerate, three-fold degenerate and four-fold degenerate sites.
Why Are Reciprocal Changes Not Equally Frequent?
One of the most surprising features of many variant lists in humans is that C->T changes (C reference, T variant) are more frequent than T->C changes. Likewise, G->A changes are more frequent than A->G changes.
At first, this might seem a bit puzzling. For example, perhaps we might expect that the two counts should be extremely similar. However, the reference makes perfect biological sense -- and the explanation below is due to Tom Blackwell.
The major mechanism for new mutations (in warm-blooded animals) is deamination of 5'-methyl C to uracil (equivalently T) producing (C -> T) or, on the complementary strand, (G -> A). This was first studied for CpG dinucleotide sites, but it also occurs at lower rates throughout the genome at any C whether followed by G or not.
More often than not, we expect that the reference genome will include the most common allele, which is also likely to be the ancestral allele. Thus, if C->T mutations are more common than T->C mutations, we expect to see an imbalance of C->T versus T->C changes. Further, when comparing rare and common variants, we expect the imbalance to be stronger for lower frequency variants.