From Genome Analysis Wiki
Jump to navigationJump to search
1,393 bytes added
, 15:05, 2 February 2010
== Grouping ==
When evaluating read mappers, we should always focus on well defined sets of reads:
* Reads with no polymorphisms.
* Reads with 1, 2, 3 or more SNPs.
* Reads with specific types of short indels (<10bp).
* Reads with larger structural variants (>100bp).
SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.
== Bulk Statistics ==
* Speed (millions of reads per hour)
* Memory requirements
* Size of output files
* Raw count of mapped reads
== Mapping Accuracy ==
The key quantities are:
* How many reads were not mapped at all?
* How many reads were mapped incorrectly? '''This is the least desirable outcome'''.
* How many reads were mapped correctly?
Correct mapping should be defined as:
* Most stringent: matches simulated location and CIGAR string.
* Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
* Incorrect: Doesn't overlap simulated location.
== Mapping Qualities ==
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.