Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

Revision as of 14:48, 8 February 2010

Grouping

When evaluating read mappers, we should always focus on well defined sets of reads:

Reads with no polymorphisms.
Reads with 1, 2, 3 or more SNPs.
Reads with specific types of short indels (<10bp).
Reads with larger structural variants (>100bp).

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

Should also be grouped according to whether reads are paired-end or single-end and according to read-length.

Bulk Statistics

Speed (millions of reads per hour)
Memory requirements
Size of output files
Raw count of mapped reads

Mapping Accuracy

The key quantities are:

How many reads were not mapped at all?
How many reads were mapped incorrectly? This is the least desirable outcome.
How many reads were mapped correctly?

Correct mapping should be defined as:

Most stringent: matches simulated location and CIGAR string.
Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
Incorrect: Doesn't overlap simulated location.

Mapping Qualities

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

Available Test Datasets

Location: wonderland:~zhanxw/BigSimulation
Scenarios:

no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30

Quality String

Picked the 75 percentile of Sanger Iluumina 108 mer test data set

Format

both base space and color space both single end and paired end

Program

generator

Usage:

        generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
        exact: Accurate sample from reference genome
        snpXX: Bring total XXX SNP for a single read or a pair of reads
        indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
        delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads

e.g. ./generator bs se exact -n 100 -l 35

Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

Revision as of 14:48, 8 February 2010

Contents

Grouping

Bulk Statistics

Mapping Accuracy

Mapping Qualities

Available Test Datasets

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

@@ Line 38: / Line 38: @@
 == Available Test Datasets ==
+*Location: wonderland:~zhanxw/BigSimulation
+*Scenarios:
+no polymorphism ;
+, 2, 3 SNP ;
+Deletion 5, 30, 200;
+Insertion 5, 30
+* Quality String
+Picked the 75 percentile of Sanger Iluumina 108 mer test data set
+* Format
+both base space and color space
+both single end and paired end
+* Program
+ generator
+Usage:
+         generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
+         exact: Accurate sample from reference genome
+         snpXX: Bring total XXX SNP for a single read or a pair of reads
+         indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
+         delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
+e.g. ./generator bs se exact -n 100 -l 35