Changes

Evaluating a Read Mapper on Simulated Data (view source)

Revision as of 05:08, 10 February 2010

1,527 bytes added , 05:08, 10 February 2010

no edit summary

Line 1: Line 1: −

== Grouping ==

+

== Grouping ==

−

When evaluating read mappers, we should always focus on well defined sets of reads:

+

When evaluating read mappers, we should always focus on well defined sets of reads:

−

* Reads with no polymorphisms.

+

*Reads with no polymorphisms.

−

* Reads with 1, 2, 3 or more SNPs.

+

*Reads with 1, 2, 3 or more SNPs.

−

* Reads with specific types of short indels (<10bp).

+

*Reads with specific types of short indels (<10bp).

−

* Reads with larger structural variants (>100bp).

+

*Reads with larger structural variants (>100bp).

−

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

+

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

−

Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.

+

Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.

−

== Bulk Statistics ==

+

== Bulk Statistics ==

−

* Speed (millions of reads per hour)

+

*Speed (millions of reads per hour)

−

* Memory requirements

+

*Memory requirements

−

* Size of output files

+

*Size of output files

−

* Raw count of mapped reads

+

*Raw count of mapped reads

−

== Mapping Accuracy ==

+

== Mapping Accuracy ==

−

The key quantities are:

+

The key quantities are:

−

* How many reads were not mapped at all?

+

*How many reads were not mapped at all?

−

* How many reads were mapped incorrectly? '''This is the least desirable outcome'''.

+

*How many reads were mapped incorrectly? '''This is the least desirable outcome'''.

−

* How many reads were mapped correctly?

+

*How many reads were mapped correctly?

−

Correct mapping should be defined as:

+

Correct mapping should be defined as:

−

* Most stringent: matches simulated location and CIGAR string.

+

*Most stringent: matches simulated location and CIGAR string.

−

* Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.

+

*Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.

−

* Incorrect: Doesn't overlap simulated location.

+

*Incorrect: Doesn't overlap simulated location.

−

== Mapping Qualities ==

+

== Mapping Qualities ==

−

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

+

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

−

== Available Test Datasets ==

+

== Available Test Datasets ==

−

*Location: wonderland:~zhanxw/BigSimulation

+

*Location: wonderland:~zhanxw/BigSimulation

−

*Scenarios:

+

*Scenarios:

−

~~no polymorphism ;~~

−

~~1, 2, 3 SNP ;~~

−

~~Deletion 5, 30, 200;~~

−

~~Insertion 5, 30~~

−

* Quality String

−

~~Picked the 75 percentile of Sanger Iluumina 108 mer test data set~~

−

* Format

−

~~both base space and color space~~

−

~~both single end and paired end, and paired end reads are given insert size 1500.~~

−

* Program (generator)

+

no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30

+

*Quality String

+

Picked the 75 percentile of Sanger Iluumina 108 mer test data set

+

*Format

+

both base space and color space both single end and paired end, and paired end reads are given insert size 1500.

+

*Program (generator)

+

Usage:

−

~~Usage:~~

exact: Accurate sample from reference genome

Line 61: Line 63:

e.g. ./generator bs se exact -n 100 -l 35

−

* Output

+

*Output

−

Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read.

+

−

For each read, the tag was named in a similar way to Sanger's.

+

Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.

+

<br>

+

= Bulk statistics result =

+

<br>

+

BWA(second) Karma(second) Scenarios

+

2594 7182 BS_SE_DEL200_1000000_50.fastq

+

2641 -1 BS_SE_DEL30_1000000_50.fastq

+

2355 -1 BS_SE_DEL5_1000000_50.fastq

+

441 7941 BS_SE_EXACT_1000000_50.fastq

+

809 282 BS_SE_INDEL30_1000000_50.fastq

+

2217 -1 BS_SE_INDEL5_1000000_50.fastq

+

645 7206 BS_SE_SNP1_1000000_50.fastq

+

1102 -1 BS_SE_SNP2_1000000_50.fastq

+

1142 -1 BS_SE_SNP3_1000000_50.fastq

+

6536 8874 BS_PE_DEL200_1000000_50_?.fastq

+

6699 9017 BS_PE_DEL30_1000000_50_?.fastq

+

6468 9033 BS_PE_DEL5_1000000_50_?.fastq

+

1743 10112 BS_PE_EXACT_1000000_50_?.fastq

+

2305 231 BS_PE_INDEL30_1000000_50_?.fastq

+

5703 2989 BS_PE_INDEL5_1000000_50_?.fastq

+

1974 3718 BS_PE_SNP1_1000000_50_?.fastq

+

2396 3339 BS_PE_SNP2_1000000_50_?.fastq

+

2817 3131 BS_PE_SNP3_1000000_50_?.fastq

+

4362 16074 CS_PE_DEL200_1000000_50_?.fastq

+

4385 -1 CS_PE_DEL30_1000000_50_?.fastq

+

4373 9287 CS_PE_DEL5_1000000_50_?.fastq

+

773 -1 CS_PE_EXACT_1000000_50_?.fastq

+

1735 3142 CS_PE_INDEL30_1000000_50_?.fastq

+

4023 8591 CS_PE_INDEL5_1000000_50_?.fastq

+

1034 10528 CS_PE_SNP1_1000000_50_?.fastq

+

2236 -1 CS_PE_SNP2_1000000_50_?.fastq

+

3810 6617 CS_PE_SNP3_1000000_50_?.fastq

+

7129 1493 CS_SE_DEL200_1000000_50.fastq

+

7115 1513 CS_SE_DEL30_1000000_50.fastq

+

7065 1542 CS_SE_DEL5_1000000_50.fastq

+

1544 1666 CS_SE_EXACT_1000000_50.fastq

+

2954 289 CS_SE_INDEL30_1000000_50.fastq

+

6547 1390 CS_SE_INDEL5_1000000_50.fastq

+

1690 1661 CS_SE_SNP1_1000000_50.fastq

+

2853 1449 CS_SE_SNP2_1000000_50.fastq

+

4039 1237 CS_SE_SNP3_1000000_50.fastq

Zhanxw

255

edits

Changes

Evaluating a Read Mapper on Simulated Data (view source)

Revision as of 05:08, 10 February 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools