Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 22: Line 22:     
In addition, multi-allelic or duplicated SNPs are counted separately at the very bottom.
 
In addition, multi-allelic or duplicated SNPs are counted separately at the very bottom.
 +
 +
== Understanding the columns of vcf-summary output ==
 +
 +
vcf-summary output can be a little bit dense to digest at the first time, but contains useful information about the quality of SNP calls.
 +
 +
First, define transitions and transversions..
 +
* '''Transition''' (Ts or Ti) refers to the bi-allelic SNPs where two alleles are A <-> G or C <-> T. In true SNPs, transitions more frequently occurs than transversions.
 +
* '''Transversion''' (Tv) refers to the bi-allelic SNPs that are not transitions.
 +
 +
Each column of vcf-summary output shows..
 +
 +
* '''FILTER''' shows the groups of SNPs represented by FILTER columns in the VCF file. See previous section to know how SNPs are grouped.
 +
* '''#SNPs''' shows the total number of SNPs that belong to each FILTER category.
 +
* '''#dbSNPs''' shows the number of SNPs that appear in the dbSNP. The version of dbSNP can be different based on the GotCloud version (e.g. 129, 135, 137)
 +
* '''%dbSNPs''' are simply [#dbSNPs] / [#SNPs], indicating the fraction of variants ''known'' by dbSNP. The rest of them will be considered as ''novel''
 +
* '''%CpG Known''' are the fraction of known (in dbSNP) SNPs that are located at CpG sites, which are enriched for transitions.
 +
* '''%CpG Novel''' are the fraction of novel (not in dbSNP) SNPs that are located at CpG sites, which are enriched for transitions.
 +
* '''Known Ts/Tv''' represents Ts/Tv ratio among known (in dbSNP) SNPs.
 +
* '''Novel Ts/Tv''' represents Ts/Tv ratio among novel (not in dbSNP) SNPs.
 +
* '''nCpG-K Ts/Tv''' represents Ts/Tv ratio among known (in dbSNPs) SNPs, when only focusing on non-CpG sites. Ts/Tv becomes more stable when excluding CpG sites.
 +
* '''nCpG-N Ts/Tv''' represents Ts/Tv ratio among novel (in dbSNPs) SNPs, when only focusing on non-CpG sites.
 +
* '''%HM3 sens''' indicate the percent of HapMap3 SNPs rediscovered among all HapMap SNPs ( [# HapMap3-overlapping SNPs] / [# all HapMap 3 SNPs] ) 
 +
* '''%HM3/SNP''' indicate the percent of HapMap3 SNPs among all SNPs in the call set ( [# HapMap3-overlapping SNPs] / [# all SNPs in the call set] )
 +
 +
== What should we expect from good SNP call sets? ==
 +
 +
If the SNP filtering was effective, you will expect the following properties in "PASS" category, but not in the other categories.
 +
 +
* Metrics between '''Known''' and '''Novel''' categories should be similar. This indicates that novel SNPs show as good qualities as known SNPs, which are enriched for true variants.
 +
* Expected '''Ts/Tv''' values for whole genome call set are around 2.2. '''non-CpG Ts/Tv''' are usually ranged around 1.8
 +
* For exome call sets, Ts/Tv are expected to be higher (around 3). Synonymous SNPs will have higher Ts/Tv (around 5.5) than non-synonymous SNPs (around 2.2).
 +
* Because of natural selection, in exomes, we expect higher Ts/Tv in known SNPs than novel SNPs. This is because novel SNPs are more likely to be non-synonymous SNPs than synonymous SNPs. Please refer to [[SNP_Call_Set_Properties]] for additional information.
 +
* '''%dbSNP''' should be usually higher for '''PASS''' categories than others, indicating that known SNPs are more likely to pass filters. However, because dbSNP also contain false positive variants, we don't necessarily expect that '''%dbSNP''' go to zero even if the filtering works perfectly.
 +
* Because HM3 consist of common SNPs, this number should be high when the sample size is large. Typically, if the sample contains only Europeans, the expected values of '''%HM3 sens''' are around 90%. If the sample contains African ancestry, the number will be close to 95% of higher.
 +
* Note that some callers uses the expected difference between Ts and Tv in their calling model as prior. In that case, Ts/Tv could be biased by the prior, and Ts/Tv may not be a good indicator of SNP qualities.

Navigation menu