Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with '== Grouping == When evaluating read mappers, we should always focus on well defined sets of reads: * Reads with no polymorphisms. * Reads with 1, 2, 3 or more SNPs. * Reads wit…')
 
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Grouping ==
+
== Grouping ==
  
When evaluating read mappers, we should always focus on well defined sets of reads:
+
When evaluating read mappers, we should always focus on well defined sets of reads:  
  
* Reads with no polymorphisms.
+
*Reads with no polymorphisms.  
* Reads with 1, 2, 3 or more SNPs.
+
*Reads with 1, 2, 3 or more SNPs.  
* Reads with specific types of short indels (<10bp).
+
*Reads with specific types of short indels (&lt;10bp).  
* Reads with larger structural variants (>100bp).
+
*Reads with larger structural variants (&gt;100bp).
  
SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.
+
SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.  
  
== Bulk Statistics ==
+
Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.
  
* Speed (millions of reads per hour)
+
== Bulk Statistics  ==
* Memory requirements
 
* Size of output files
 
* Raw count of mapped reads
 
  
== Mapping Accuracy ==
+
*Speed (millions of reads per hour)
 +
*Memory requirements
 +
*Size of output files
 +
*Raw count of mapped reads
  
The key quantities are:
+
== Mapping Accuracy  ==
  
* How many reads were not mapped at all?
+
The key quantities are:
* How many reads were mapped incorrectly? '''This is the least desirable outcome'''.
 
* How many reads were mapped correctly?
 
  
Correct mapping should be defined as:
+
*How many reads were not mapped at all?
 +
*How many reads were mapped incorrectly? '''This is the least desirable outcome'''.
 +
*How many reads were mapped correctly?
  
* Most stringent: matches simulated location and CIGAR string.
+
Correct mapping should be defined as:  
* Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
 
* Incorrect: Doesn't overlap simulated location.
 
  
== Mapping Qualities ==
+
*Most stringent: matches simulated location and CIGAR string.
 +
*Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
 +
*Incorrect: Doesn't overlap simulated location.
 +
 
 +
== Mapping Qualities ==
  
 
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.
 
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

Latest revision as of 23:19, 8 September 2010

Grouping

When evaluating read mappers, we should always focus on well defined sets of reads:

  • Reads with no polymorphisms.
  • Reads with 1, 2, 3 or more SNPs.
  • Reads with specific types of short indels (<10bp).
  • Reads with larger structural variants (>100bp).

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

Should also be grouped according to whether reads are paired-end or single-end and according to read-length.

Bulk Statistics

  • Speed (millions of reads per hour)
  • Memory requirements
  • Size of output files
  • Raw count of mapped reads

Mapping Accuracy

The key quantities are:

  • How many reads were not mapped at all?
  • How many reads were mapped incorrectly? This is the least desirable outcome.
  • How many reads were mapped correctly?

Correct mapping should be defined as:

  • Most stringent: matches simulated location and CIGAR string.
  • Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
  • Incorrect: Doesn't overlap simulated location.

Mapping Qualities

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.