Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 159: Line 159:
 
<pre>
 
<pre>
 
BWA(second) Karma(second) Scenarios
 
BWA(second) Karma(second) Scenarios
2594 7182 BS_SE_DEL200_1000000_50.fastq
+
7561 4638 BS_PE_DEL200_1000000_50_?.fastq
2641 -1 BS_SE_DEL30_1000000_50.fastq
+
7548 4677 BS_PE_DEL30_1000000_50_?.fastq
2355 -1 BS_SE_DEL5_1000000_50.fastq
+
7225 4730 BS_PE_DEL5_1000000_50_?.fastq
441 7941 BS_SE_EXACT_1000000_50.fastq
+
975 6531 BS_PE_EXACT_1000000_50_?.fastq
809 282 BS_SE_INDEL30_1000000_50.fastq
+
1726 793 BS_PE_INDEL30_1000000_50_?.fastq
2217 -1 BS_SE_INDEL5_1000000_50.fastq
+
6199 4140 BS_PE_INDEL5_1000000_50_?.fastq
645 7206 BS_SE_SNP1_1000000_50.fastq
+
1193 4949 BS_PE_SNP1_1000000_50_?.fastq
1102 -1 BS_SE_SNP2_1000000_50.fastq
+
1646 4513 BS_PE_SNP2_1000000_50_?.fastq
1142 -1 BS_SE_SNP3_1000000_50.fastq
+
2064 4089 BS_PE_SNP3_1000000_50_?.fastq
6536 8874 BS_PE_DEL200_1000000_50_?.fastq
+
2594 3707 BS_SE_DEL200_1000000_50.fastq
6699 9017 BS_PE_DEL30_1000000_50_?.fastq
+
2641 3942 BS_SE_DEL30_1000000_50.fastq
6468 9033 BS_PE_DEL5_1000000_50_?.fastq
+
2355 4263 BS_SE_DEL5_1000000_50.fastq
1743 10112 BS_PE_EXACT_1000000_50_?.fastq
+
441 4228 BS_SE_EXACT_1000000_50.fastq
2305 231 BS_PE_INDEL30_1000000_50_?.fastq
+
809 764 BS_SE_INDEL30_1000000_50.fastq
5703 2989 BS_PE_INDEL5_1000000_50_?.fastq
+
2217 3932 BS_SE_INDEL5_1000000_50.fastq
1974 3718 BS_PE_SNP1_1000000_50_?.fastq
+
645 3808 BS_SE_SNP1_1000000_50.fastq
2396 3339 BS_PE_SNP2_1000000_50_?.fastq
+
1102 3473 BS_SE_SNP2_1000000_50.fastq
2817 3131 BS_PE_SNP3_1000000_50_?.fastq
+
1142 3267 BS_SE_SNP3_1000000_50.fastq
4362 16074 CS_PE_DEL200_1000000_50_?.fastq
+
6193 6909 CS_PE_DEL200_1000000_50_?.fastq
4385 -1 CS_PE_DEL30_1000000_50_?.fastq
+
6173 6636 CS_PE_DEL30_1000000_50_?.fastq
4373 9287 CS_PE_DEL5_1000000_50_?.fastq
+
6096 6702 CS_PE_DEL5_1000000_50_?.fastq
773 -1 CS_PE_EXACT_1000000_50_?.fastq
+
858 8496 CS_PE_EXACT_1000000_50_?.fastq
1735 3142 CS_PE_INDEL30_1000000_50_?.fastq
+
1743 948 CS_PE_INDEL30_1000000_50_?.fastq
4023 8591 CS_PE_INDEL5_1000000_50_?.fastq
+
5517 5412 CS_PE_INDEL5_1000000_50_?.fastq
1034 10528 CS_PE_SNP1_1000000_50_?.fastq
+
1253 8454 CS_PE_SNP1_1000000_50_?.fastq
2236 -1 CS_PE_SNP2_1000000_50_?.fastq
+
2113 7420 CS_PE_SNP2_1000000_50_?.fastq
3810 6617 CS_PE_SNP3_1000000_50_?.fastq
+
2622 6076 CS_PE_SNP3_1000000_50_?.fastq
7129 1493 CS_SE_DEL200_1000000_50.fastq
+
3878 1493 CS_SE_DEL200_1000000_50.fastq
7115 1513 CS_SE_DEL30_1000000_50.fastq
+
3859 1513 CS_SE_DEL30_1000000_50.fastq
7065 1542 CS_SE_DEL5_1000000_50.fastq
+
3775 1542 CS_SE_DEL5_1000000_50.fastq
1544 1666 CS_SE_EXACT_1000000_50.fastq
+
621 1666 CS_SE_EXACT_1000000_50.fastq
2954 289 CS_SE_INDEL30_1000000_50.fastq
+
1392 289 CS_SE_INDEL30_1000000_50.fastq
6547 1390 CS_SE_INDEL5_1000000_50.fastq
+
3525 1390 CS_SE_INDEL5_1000000_50.fastq
1690 1661 CS_SE_SNP1_1000000_50.fastq
+
874 1661 CS_SE_SNP1_1000000_50.fastq
2853 1449 CS_SE_SNP2_1000000_50.fastq
+
1965 1449 CS_SE_SNP2_1000000_50.fastq
4039 1237 CS_SE_SNP3_1000000_50.fastq
+
3314 1237 CS_SE_SNP3_1000000_50.fastq
 
</pre>
 
</pre>

Revision as of 11:52, 16 February 2010

Grouping

When evaluating read mappers, we should always focus on well defined sets of reads:

  • Reads with no polymorphisms.
  • Reads with 1, 2, 3 or more SNPs.
  • Reads with specific types of short indels (<10bp).
  • Reads with larger structural variants (>100bp).

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

Should also be grouped according to whether reads are paired-end or single-end and according to read-length.

Bulk Statistics

  • Speed (millions of reads per hour)
  • Memory requirements
  • Size of output files
  • Raw count of mapped reads

Mapping Accuracy

The key quantities are:

  • How many reads were not mapped at all?
  • How many reads were mapped incorrectly? This is the least desirable outcome.
  • How many reads were mapped correctly?

Correct mapping should be defined as:

  • Most stringent: matches simulated location and CIGAR string.
  • Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
  • Incorrect: Doesn't overlap simulated location.

Mapping Qualities

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

Available Test Datasets

  • Location: wonderland:~zhanxw/BigSimulation
  • Scenarios:

no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30

  • Quality String

Picked the 75 percentile of Sanger Iluumina 108 mer test data set BCCCCBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAA@@@@@@@@@@@@@@@???????????>>>>>>>>>>>>=========<<<<<<<<<<;;";


  • Format
Both base space and color space
Both single end and paired end, and paired end reads are given insert size 1500.
Forward strand and reverse strand are randomly assign with probability 1/2
  • Tag

@2:12345:F:SE:Exact @2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C @2:12345:F:PE+offset:SNP:2,12345,A,G (ref is A, read is G) @2:12345:F:PE+offset:Indel:25M30D5M

  • File Naming

BS_SE_EXACT_1M_50 BS_SE_SNP1_1M_50 CS_SE_INDEL1_1M CS_SE_INDEL30_1M CS_SE_INDEL200_1M CS_SE_DEL1_1M

For PE, appending "_1" and "_2", e.g.: PE_EXACT_1M_1 PE_EXACT_1M_2

  • Program (generator)

Usage:

        generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
        exact: Accurate sample from reference genome
        snpXX: Bring total XXX SNP for a single read or a pair of reads
        indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
        delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
        e.g. ./generator bs se exact -n 100 -l 35
  • Output

Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.

  • Example

For illumina (from Sanger, 108mer hap1 test file): Example:

_1 file:
@20:14812275:F:217;None;None/1
AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
+
%27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%

@15:59364621:R:-118;None;None/1
TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
+
%%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482

_2 file:
@20:14812275:F:217;None;None/2
CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
+
%%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-

@15:59364621:R:-118;None;None/2
AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
+
%3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%

Conclusion: If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence. If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence. The first strand always position can always obtain from tag, first two fields (seperated by colon). The second strand position is first strand position plus the offset.

For SOLiD (from Sanger, 50 mer hap1 test file) e.g.

_1 file:
>2:67043752:F:1445;2,67043761,A,G;None
T12221203021201200302123102221322000012301300211213
  22212031230012003021211022213220000123013022112123 (ref)
>4:125830377:R:-1541;None;None
T30002222300330113020203010322111010300030003230320

_2 file:
>2:67043752:F:1445;2,67043761,A,G;None
G13031223023023012201210020003310110111111203310211
  30312230230230122012100200033121201111113033112112 (ref)
 
>4:125830377:R:-1541;None;None
G13311131230200010201210032223330120312000301230032

Conclusion: The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome), where their positions are the same as Illumina reads.


Bulk statistics result

Running time (all submitted to the MOSIX client nodes)
Calculated by "./parseRunbatch.py batch2.log |cutrange 0,-1|charrange :-1".

Log file is from runbatch.pl and negative time means unfinished (at the moment of editing).

TODO: Add file size comparison; add link to memory page summarized by Dharknes.

BWA(second)	Karma(second)	Scenarios
7561	4638	BS_PE_DEL200_1000000_50_?.fastq
7548	4677	BS_PE_DEL30_1000000_50_?.fastq
7225	4730	BS_PE_DEL5_1000000_50_?.fastq
975	6531	BS_PE_EXACT_1000000_50_?.fastq
1726	793	BS_PE_INDEL30_1000000_50_?.fastq
6199	4140	BS_PE_INDEL5_1000000_50_?.fastq
1193	4949	BS_PE_SNP1_1000000_50_?.fastq
1646	4513	BS_PE_SNP2_1000000_50_?.fastq
2064	4089	BS_PE_SNP3_1000000_50_?.fastq
2594	3707	BS_SE_DEL200_1000000_50.fastq
2641	3942	BS_SE_DEL30_1000000_50.fastq
2355	4263	BS_SE_DEL5_1000000_50.fastq
441	4228	BS_SE_EXACT_1000000_50.fastq
809	764	BS_SE_INDEL30_1000000_50.fastq
2217	3932	BS_SE_INDEL5_1000000_50.fastq
645	3808	BS_SE_SNP1_1000000_50.fastq
1102	3473	BS_SE_SNP2_1000000_50.fastq
1142	3267	BS_SE_SNP3_1000000_50.fastq
6193	6909	CS_PE_DEL200_1000000_50_?.fastq
6173	6636	CS_PE_DEL30_1000000_50_?.fastq
6096	6702	CS_PE_DEL5_1000000_50_?.fastq
858	8496	CS_PE_EXACT_1000000_50_?.fastq
1743	948	CS_PE_INDEL30_1000000_50_?.fastq
5517	5412	CS_PE_INDEL5_1000000_50_?.fastq
1253	8454	CS_PE_SNP1_1000000_50_?.fastq
2113	7420	CS_PE_SNP2_1000000_50_?.fastq
2622	6076	CS_PE_SNP3_1000000_50_?.fastq
3878	1493	CS_SE_DEL200_1000000_50.fastq
3859	1513	CS_SE_DEL30_1000000_50.fastq
3775	1542	CS_SE_DEL5_1000000_50.fastq
621	1666	CS_SE_EXACT_1000000_50.fastq
1392	289	CS_SE_INDEL30_1000000_50.fastq
3525	1390	CS_SE_INDEL5_1000000_50.fastq
874	1661	CS_SE_SNP1_1000000_50.fastq
1965	1449	CS_SE_SNP2_1000000_50.fastq
3314	1237	CS_SE_SNP3_1000000_50.fastq