From Genome Analysis Wiki
Jump to: navigation, search

Evaluating a Read Mapper on Simulated Data

5,867 bytes removed, 23:19, 8 September 2010
no edit summary
== Mapping Qualities ==
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.  == Available Test Datasets == *Location: wonderland:~zhanxw/BigSimulation *Scenarios: no polymorphism&nbsp;; 1, 2, 3 SNP&nbsp;; Deletion 5, 30, 200; Insertion 5, 30  *Quality String Picked the 75 percentile of Sanger Iluumina 108 mer test data set BCCCCBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAA@@@@@@@@@@@@@@@???????????>>>>>>>>>>>>=========<<<<<<<<<<;;";   *Format Both base space and color space  Both single end and paired end, and paired end reads are given insert size 1500.  Forward strand and reverse strand are randomly assign with probability 1/2 * Tag @2:12345:F:SE:Exact @2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C  @2:12345:F:PE+offset:SNP:2,12345,A,G (ref is A, read is G) @2:12345:F:PE+offset:Indel:25M30D5M * File Naming BS_SE_EXACT_1M_50 BS_SE_SNP1_1M_50 CS_SE_INDEL1_1M CS_SE_INDEL30_1M CS_SE_INDEL200_1M CS_SE_DEL1_1M For PE, appending "_1" and "_2", e.g.: PE_EXACT_1M_1 PE_EXACT_1M_2 *Program (generator) Usage:   generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize exact: Accurate sample from reference genome snpXX: Bring total XXX SNP for a single read or a pair of reads indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads e.g. ./generator bs se exact -n 100 -l 35 *Output Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.  * Example For illumina (from Sanger, 108mer hap1 test file): Example:<pre>_1 file:@20:14812275:F:217;None;None/1AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT+%27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%% @15:59364621:R:-118;None;None/1TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC+%%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482 _2 file:@20:14812275:F:217;None;None/2CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC+%%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%- @15:59364621:R:-118;None;None/2AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG+%3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%</pre> Conclusion: If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence. If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence. The first strand always position can always obtain from tag, first two fields (seperated by colon). The second strand position is first strand position plus the offset. For SOLiD (from Sanger, 50 mer hap1 test file)e.g. <pre>_1 file:>2:67043752:F:1445;2,67043761,A,G;NoneT12221203021201200302123102221322000012301300211213 22212031230012003021211022213220000123013022112123 (ref)>4:125830377:R:-1541;None;NoneT30002222300330113020203010322111010300030003230320 _2 file:>2:67043752:F:1445;2,67043761,A,G;NoneG13031223023023012201210020003310110111111203310211 30312230230230122012100200033121201111113033112112 (ref) >4:125830377:R:-1541;None;NoneG13311131230200010201210032223330120312000301230032 </pre> Conclusion:The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome), where their positions are the same as Illumina reads. <br> = Bulk statistics result =Running time (all submitted to the MOSIX client nodes)<br>Calculated by "./ batch2.log |cutrange 0,-1|charrange :-1". Log file is from and negative time means unfinished (at the moment of editing).  TODO: Add file size comparison; add link to memory page summarized by Dharknes.<pre>BWA(second) Karma(second) Scenarios7561 4638 BS_PE_DEL200_1000000_50_?.fastq7548 4677 BS_PE_DEL30_1000000_50_?.fastq7225 4730 BS_PE_DEL5_1000000_50_?.fastq975 6531 BS_PE_EXACT_1000000_50_?.fastq1726 793 BS_PE_INDEL30_1000000_50_?.fastq6199 4140 BS_PE_INDEL5_1000000_50_?.fastq1193 4949 BS_PE_SNP1_1000000_50_?.fastq1646 4513 BS_PE_SNP2_1000000_50_?.fastq2064 4089 BS_PE_SNP3_1000000_50_?.fastq2594 3707 BS_SE_DEL200_1000000_50.fastq2641 3942 BS_SE_DEL30_1000000_50.fastq2355 4263 BS_SE_DEL5_1000000_50.fastq441 4228 BS_SE_EXACT_1000000_50.fastq809 764 BS_SE_INDEL30_1000000_50.fastq2217 3932 BS_SE_INDEL5_1000000_50.fastq645 3808 BS_SE_SNP1_1000000_50.fastq1102 3473 BS_SE_SNP2_1000000_50.fastq1142 3267 BS_SE_SNP3_1000000_50.fastq6193 6909 CS_PE_DEL200_1000000_50_?.fastq6173 6636 CS_PE_DEL30_1000000_50_?.fastq6096 6702 CS_PE_DEL5_1000000_50_?.fastq858 8496 CS_PE_EXACT_1000000_50_?.fastq1743 948 CS_PE_INDEL30_1000000_50_?.fastq5517 5412 CS_PE_INDEL5_1000000_50_?.fastq1253 8454 CS_PE_SNP1_1000000_50_?.fastq2113 7420 CS_PE_SNP2_1000000_50_?.fastq2622 6076 CS_PE_SNP3_1000000_50_?.fastq3878 1493 CS_SE_DEL200_1000000_50.fastq3859 1513 CS_SE_DEL30_1000000_50.fastq3775 1542 CS_SE_DEL5_1000000_50.fastq621 1666 CS_SE_EXACT_1000000_50.fastq1392 289 CS_SE_INDEL30_1000000_50.fastq3525 1390 CS_SE_INDEL5_1000000_50.fastq874 1661 CS_SE_SNP1_1000000_50.fastq1965 1449 CS_SE_SNP2_1000000_50.fastq3314 1237 CS_SE_SNP3_1000000_50.fastq</pre>

Navigation menu