Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

Latest revision as of 23:19, 8 September 2010

Grouping

When evaluating read mappers, we should always focus on well defined sets of reads:

Reads with no polymorphisms.
Reads with 1, 2, 3 or more SNPs.
Reads with specific types of short indels (<10bp).
Reads with larger structural variants (>100bp).

SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.

Should also be grouped according to whether reads are paired-end or single-end and according to read-length.

Bulk Statistics

Speed (millions of reads per hour)
Memory requirements
Size of output files
Raw count of mapped reads

Mapping Accuracy

The key quantities are:

How many reads were not mapped at all?
How many reads were mapped incorrectly? This is the least desirable outcome.
How many reads were mapped correctly?

Correct mapping should be defined as:

Most stringent: matches simulated location and CIGAR string.
Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
Incorrect: Doesn't overlap simulated location.

Mapping Qualities

We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.

@@ Line 35: / Line 35: @@
 == Mapping Qualities  ==
 We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.
-== Available Test Datasets  ==
-*Location: wonderland:~zhanxw/BigSimulation
-*Scenarios:
-no polymorphism&nbsp;; 1, 2, 3 SNP&nbsp;; Deletion 5, 30, 200; Insertion 5, 30
-*Quality String
-Picked the 75 percentile of Sanger Iluumina 108 mer test data set
-BCCCCBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAA@@@@@@@@@@@@@@@???????????>>>>>>>>>>>>=========<<<<<<<<<<;;";
-*Format
-; Both base space and color space
-; Both single end and paired end, and paired end reads are given insert size 1500.
-; Forward strand and reverse strand are randomly assign with probability 1/2
-* Tag
-@2:12345:F:SE:Exact
-@2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C
-@2:12345:F:PE+offset:SNP:2,12345,A,G   (ref is A, read is G)
-@2:12345:F:PE+offset:Indel:25M30D5M
-* File Naming
-BS_SE_EXACT_1M_50
-BS_SE_SNP1_1M_50
-CS_SE_INDEL1_1M
-CS_SE_INDEL30_1M
-CS_SE_INDEL200_1M
-CS_SE_DEL1_1M
-For PE, appending "_1" and "_2", e.g.:
-PE_EXACT_1M_1
-PE_EXACT_1M_2
-*Program (generator)
-Usage:
-         generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
-         exact: Accurate sample from reference genome
-         snpXX: Bring total XXX SNP for a single read or a pair of reads
-         indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
-         delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
-         e.g. ./generator bs se exact -n 100 -l 35
-*Output
-Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.
-* Example
-For illumina (from Sanger, 108mer hap1 test file):
-Example:
-<pre>
-_1 file:
-@20:14812275:F:217;None;None/1
-AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
-+
-%27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%
-@15:59364621:R:-118;None;None/1
-TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
-+
-%%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482
-_2 file:
-@20:14812275:F:217;None;None/2
-CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
-+
-%%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-
-@15:59364621:R:-118;None;None/2
-AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
-+
-%3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%
-</pre>
-Conclusion:
-If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence.
-If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence.
-The first strand always position can always obtain from tag, first two fields (seperated by colon).
-The second strand position is first strand position plus the offset.
-For SOLiD (from Sanger, 50 mer hap1 test file)
-e.g.
-<pre>
-_1 file:
->2:67043752:F:1445;2,67043761,A,G;None
-T12221203021201200302123102221322000012301300211213
-  22212031230012003021211022213220000123013022112123 (ref)
->4:125830377:R:-1541;None;None
-T30002222300330113020203010322111010300030003230320
-_2 file:
->2:67043752:F:1445;2,67043761,A,G;None
-G13031223023023012201210020003310110111111203310211
-  30312230230230122012100200033121201111113033112112 (ref)
->4:125830377:R:-1541;None;None
-G13311131230200010201210032223330120312000301230032
-</pre>
-Conclusion:
-The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome),
-where their positions are the same as Illumina reads.
-<br>
-= Bulk statistics result  =
-Running time (all submitted to the MOSIX client nodes)
-<br>
-Calculated by "./parseRunbatch.py batch2.log  |cutrange 0,-1|charrange :-1".
-Log file is from runbatch.pl and negative time means unfinished (at the moment of editing).
-TODO: Add file size comparison; add link to memory page summarized by Dharknes.
-<pre>
-BWA(second)	Karma(second)	Scenarios
-	4638	BS_PE_DEL200_1000000_50_?.fastq
-	4677	BS_PE_DEL30_1000000_50_?.fastq
-	4730	BS_PE_DEL5_1000000_50_?.fastq
-	6531	BS_PE_EXACT_1000000_50_?.fastq
-	793	BS_PE_INDEL30_1000000_50_?.fastq
-	4140	BS_PE_INDEL5_1000000_50_?.fastq
-	4949	BS_PE_SNP1_1000000_50_?.fastq
-	4513	BS_PE_SNP2_1000000_50_?.fastq
-	4089	BS_PE_SNP3_1000000_50_?.fastq
-	3707	BS_SE_DEL200_1000000_50.fastq
-	3942	BS_SE_DEL30_1000000_50.fastq
-	4263	BS_SE_DEL5_1000000_50.fastq
-	4228	BS_SE_EXACT_1000000_50.fastq
-	764	BS_SE_INDEL30_1000000_50.fastq
-	3932	BS_SE_INDEL5_1000000_50.fastq
-	3808	BS_SE_SNP1_1000000_50.fastq
-	3473	BS_SE_SNP2_1000000_50.fastq
-	3267	BS_SE_SNP3_1000000_50.fastq
-	6909	CS_PE_DEL200_1000000_50_?.fastq
-	6636	CS_PE_DEL30_1000000_50_?.fastq
-	6702	CS_PE_DEL5_1000000_50_?.fastq
-	8496	CS_PE_EXACT_1000000_50_?.fastq
-	948	CS_PE_INDEL30_1000000_50_?.fastq
-	5412	CS_PE_INDEL5_1000000_50_?.fastq
-	8454	CS_PE_SNP1_1000000_50_?.fastq
-	7420	CS_PE_SNP2_1000000_50_?.fastq
-	6076	CS_PE_SNP3_1000000_50_?.fastq
-	1493	CS_SE_DEL200_1000000_50.fastq
-	1513	CS_SE_DEL30_1000000_50.fastq
-	1542	CS_SE_DEL5_1000000_50.fastq
-	1666	CS_SE_EXACT_1000000_50.fastq
-	289	CS_SE_INDEL30_1000000_50.fastq
-	1390	CS_SE_INDEL5_1000000_50.fastq
-	1661	CS_SE_SNP1_1000000_50.fastq
-	1449	CS_SE_SNP2_1000000_50.fastq
-	1237	CS_SE_SNP3_1000000_50.fastq
-</pre>

Difference between revisions of "Evaluating a Read Mapper on Simulated Data"

Latest revision as of 23:19, 8 September 2010

Contents

Grouping

Bulk Statistics

Mapping Accuracy

Mapping Qualities

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools