Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

From Genome Analysis Wiki
Jump to: navigation, search
 
(8 intermediate revisions by one other user not shown)
Line 35: Line 35:
 
== Mapping Qualities  ==
 
== Mapping Qualities  ==
  
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.  
+
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.
 
 
== Available Test Datasets  ==
 
 
 
*Location: wonderland:~zhanxw/BigSimulation
 
*Scenarios:
 
 
 
no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30
 
 
 
*Quality String
 
 
 
Picked the 75 percentile of Sanger Iluumina 108 mer test data set
 
 
 
*Format
 
 
 
both base space and color space both single end and paired end, and paired end reads are given insert size 1500.
 
 
 
*Program (generator)
 
 
 
Usage:
 
 
 
        generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
 
        exact: Accurate sample from reference genome
 
        snpXX: Bring total XXX SNP for a single read or a pair of reads
 
        indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
 
        delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
 
        e.g. ./generator bs se exact -n 100 -l 35
 
 
 
*Output
 
 
 
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.
 
 
 
<br>
 
 
 
= Bulk statistics result  =
 
Running time (all submitted to the MOSIX client node)
 
<br>
 
 
 
BWA(second) Karma(second) Scenarios
 
2594 7182 BS_SE_DEL200_1000000_50.fastq
 
2641 -1 BS_SE_DEL30_1000000_50.fastq
 
2355 -1 BS_SE_DEL5_1000000_50.fastq
 
441 7941 BS_SE_EXACT_1000000_50.fastq
 
809 282 BS_SE_INDEL30_1000000_50.fastq
 
2217 -1 BS_SE_INDEL5_1000000_50.fastq
 
645 7206 BS_SE_SNP1_1000000_50.fastq
 
1102 -1 BS_SE_SNP2_1000000_50.fastq
 
1142 -1 BS_SE_SNP3_1000000_50.fastq
 
6536 8874 BS_PE_DEL200_1000000_50_?.fastq
 
6699 9017 BS_PE_DEL30_1000000_50_?.fastq
 
6468 9033 BS_PE_DEL5_1000000_50_?.fastq
 
1743 10112 BS_PE_EXACT_1000000_50_?.fastq
 
2305 231 BS_PE_INDEL30_1000000_50_?.fastq
 
5703 2989 BS_PE_INDEL5_1000000_50_?.fastq
 
1974 3718 BS_PE_SNP1_1000000_50_?.fastq
 
2396 3339 BS_PE_SNP2_1000000_50_?.fastq
 
2817 3131 BS_PE_SNP3_1000000_50_?.fastq
 
4362 16074 CS_PE_DEL200_1000000_50_?.fastq
 
4385 -1 CS_PE_DEL30_1000000_50_?.fastq
 
4373 9287 CS_PE_DEL5_1000000_50_?.fastq
 
773 -1 CS_PE_EXACT_1000000_50_?.fastq
 
1735 3142 CS_PE_INDEL30_1000000_50_?.fastq
 
4023 8591 CS_PE_INDEL5_1000000_50_?.fastq
 
1034 10528 CS_PE_SNP1_1000000_50_?.fastq
 
2236 -1 CS_PE_SNP2_1000000_50_?.fastq
 
3810 6617 CS_PE_SNP3_1000000_50_?.fastq
 
7129 1493 CS_SE_DEL200_1000000_50.fastq
 
7115 1513 CS_SE_DEL30_1000000_50.fastq
 
7065 1542 CS_SE_DEL5_1000000_50.fastq
 
1544 1666 CS_SE_EXACT_1000000_50.fastq
 
2954 289 CS_SE_INDEL30_1000000_50.fastq
 
6547 1390 CS_SE_INDEL5_1000000_50.fastq
 
1690 1661 CS_SE_SNP1_1000000_50.fastq
 
2853 1449 CS_SE_SNP2_1000000_50.fastq
 
4039 1237 CS_SE_SNP3_1000000_50.fastq
 

Latest revision as of 22:19, 8 September 2010

Grouping

When evaluating read mappers, we should always focus on well defined sets of reads:

  • Reads with no polymorphisms.
  • Reads with 1, 2, 3 or more SNPs.
  • Reads with specific types of short indels (<10bp).
  • Reads with larger structural variants (>100bp).

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

Should also be grouped according to whether reads are paired-end or single-end and according to read-length.

Bulk Statistics

  • Speed (millions of reads per hour)
  • Memory requirements
  • Size of output files
  • Raw count of mapped reads

Mapping Accuracy

The key quantities are:

  • How many reads were not mapped at all?
  • How many reads were mapped incorrectly? This is the least desirable outcome.
  • How many reads were mapped correctly?

Correct mapping should be defined as:

  • Most stringent: matches simulated location and CIGAR string.
  • Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
  • Incorrect: Doesn't overlap simulated location.

Mapping Qualities

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.