Changes

1,096 bytes removed , 23:19, 8 September 2010

no edit summary

Line 1: Line 1: −

== Grouping ==

+

== Grouping ==

−

When evaluating read mappers, we should always focus on well defined sets of reads:

+

When evaluating read mappers, we should always focus on well defined sets of reads:

−

* Reads with no polymorphisms.

+

*Reads with no polymorphisms.

−

* Reads with 1, 2, 3 or more SNPs.

+

*Reads with 1, 2, 3 or more SNPs.

−

* Reads with specific types of short indels (<10bp).

+

*Reads with specific types of short indels (<10bp).

−

* Reads with larger structural variants (>100bp).

+

*Reads with larger structural variants (>100bp).

−

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

+

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

−

Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.

+

Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.

−

== Bulk Statistics ==

+

== Bulk Statistics ==

−

* Speed (millions of reads per hour)

+

*Speed (millions of reads per hour)

−

* Memory requirements

+

*Memory requirements

−

* Size of output files

+

*Size of output files

−

* Raw count of mapped reads

+

*Raw count of mapped reads

−

== Mapping Accuracy ==

+

== Mapping Accuracy ==

−

The key quantities are:

+

The key quantities are:

−

* How many reads were not mapped at all?

+

*How many reads were not mapped at all?

−

* How many reads were mapped incorrectly? '''This is the least desirable outcome'''.

+

*How many reads were mapped incorrectly? '''This is the least desirable outcome'''.

−

* How many reads were mapped correctly?

+

*How many reads were mapped correctly?

−

Correct mapping should be defined as:

+

Correct mapping should be defined as:

−

* Most stringent: matches simulated location and CIGAR string.

+

*Most stringent: matches simulated location and CIGAR string.

−

* Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.

+

*Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.

−

* Incorrect: Doesn't overlap simulated location.

+

*Incorrect: Doesn't overlap simulated location.

−

== Mapping Qualities ==

+

== Mapping Qualities ==

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

−

~~== Available Test Datasets ==~~

−

*Location: wonderland:~zhanxw/BigSimulation

−

*Scenarios:

−

~~no polymorphism ;~~

−

~~1, 2, 3 SNP ;~~

−

~~Deletion 5, 30, 200;~~

−

~~Insertion 5, 30~~

−

* Quality String

−

~~Picked the 75 percentile of Sanger Iluumina 108 mer test data set~~

−

* Format

−

~~both base space and color space~~

−

~~both single end and paired end, and paired end reads are given insert size 1500.~~

−

* Program (generator)

−

~~Usage:~~

−

−

~~exact: Accurate sample from reference genome~~

−

~~snpXX: Bring total XXX SNP for a single read or a pair of reads~~

−

~~indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads~~

−

~~delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads~~

−

~~e.g. ./generator bs se exact -n 100 -l 35~~

−

* Output

−

~~Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read.~~

−

~~For each read, the tag was named in a similar way to Sanger's.~~

Goncalo

Bureaucrats, Administrators

1,555

edits

Changes

Evaluating a Read Mapper on Simulated Data (view source)

Revision as of 23:19, 8 September 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools